fix: Allow overlapping context scopes #2378

bantonsson · 2024-12-03T12:59:32Z

Changes

This PR changes the single current Context to a more robust stack that is resilient to out of order and overlapping Context scopes. This robustness is needed to be able to support better interop with tokio-rs/tracing.

There is also an optimization based on the fact that the Map in the Context is immutable, and these are the performance benchmarks for all the context related benchmarks.

name	baseDuration	changesDuration	difference %
`context/has_active_span/in-cx/alt`	`8.6±1.15ns`	`8.4±0.15ns`	`-2.0`
`context/has_active_span/in-cx/spec`	`16.5±0.06ns`	`16.5±0.05ns`	`0.0`
`context/has_active_span/no-cx/alt`	`8.4±0.01ns`	`8.4±0.02ns`	`0.0`
`context/has_active_span/no-cx/spec`	`15.6±0.05ns`	`15.2±0.07ns`	`-2.0`
`context/has_active_span/no-sdk/alt`	`8.4±0.02ns`	`8.4±0.04ns`	`0.0`
`context/has_active_span/no-sdk/spec`	`15.6±0.06ns`	`15.2±0.05ns`	`-2.0`
`context/is_recording/in-cx/alt`	`4.7±0.05ns`	`4.7±0.07ns`	`0.0`
`context/is_recording/in-cx/spec`	`18.1±0.12ns`	`17.8±0.13ns`	`-2.0`
`context/is_recording/no-cx/alt`	`4.7±0.05ns`	`4.7±0.06ns`	`0.0`
`context/is_recording/no-cx/spec`	`15.9±0.06ns`	`15.9±0.12ns`	`0.0`
`context/is_recording/no-sdk/alt`	`4.7±0.05ns`	`4.7±0.04ns`	`0.0`
`context/is_recording/no-sdk/spec`	`15.9±0.06ns`	`15.9±0.08ns`	`0.0`
`context/is_sampled/in-cx/alt`	`8.7±0.18ns`	`8.7±0.02ns`	`0.0`
`context/is_sampled/in-cx/spec`	`16.8±0.03ns`	`16.5±0.04ns`	`-2.0`
`context/is_sampled/no-cx/alt`	`8.7±0.03ns`	`8.7±0.06ns`	`0.0`
`context/is_sampled/no-cx/spec`	`15.6±0.11ns`	`15.3±0.14ns`	`-2.0`
`context/is_sampled/no-sdk/alt`	`8.7±0.03ns`	`8.7±0.04ns`	`0.0`
`context/is_sampled/no-sdk/spec`	`15.6±0.11ns`	`15.2±0.07ns`	`-2.0`
`context_attach/nested_cx/empty_cx`	`48.6±0.31ns`	`29.4±0.26ns`	`-39`
`context_attach/nested_cx/single_value_cx`	`96.9±1.66ns`	`30.8±0.11ns`	`-68`
`context_attach/nested_cx/span_cx`	`55.7±0.21ns`	`30.9±0.28ns`	`-44`
`context_attach/out_of_order_cx_drop/empty_cx`	`51.6±0.60ns`	`38.0±0.27ns`	`-26`
`context_attach/out_of_order_cx_drop/single_value_cx`	`161.5±673.14ns`	`39.5±0.79ns`	`-76`
`context_attach/out_of_order_cx_drop/span_cx`	`61.0±0.37ns`	`39.6±0.29ns`	`-35`
`context_attach/single_cx/empty_cx`	`29.3±0.23ns`	`16.1±0.09ns`	`-45`
`context_attach/single_cx/single_value_cx`	`43.5±1.09ns`	`17.6±0.09ns`	`-60`
`context_attach/single_cx/span_cx`	`29.5±0.14ns`	`17.2±0.12ns`	`-42`

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

codecov · 2024-12-03T13:03:05Z

Codecov Report

Attention: Patch coverage is 97.26027% with 4 lines in your changes missing coverage. Please review.

Project coverage is 79.6%. Comparing base (ff33638) to head (72f1e3a).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
opentelemetry/src/context.rs	97.2%	4 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##            main   #2378    +/-   ##
======================================
  Coverage   79.5%   79.6%            
======================================
  Files        123     123            
  Lines      22923   23041   +118     
======================================
+ Hits       18242   18356   +114     
- Misses      4681    4685     +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

opentelemetry/src/context.rs

opentelemetry/benches/context_attach.rs

opentelemetry/src/context.rs

cijothomas · 2025-02-12T02:14:56Z

@bantonsson Can you add some details on how this fixes the issue of contexts being dropped out-of-order? Perhaps some code comments will help future readers.

opentelemetry/src/context.rs

bantonsson · 2025-02-12T10:15:09Z

@cijothomas I'll clean this up a bit more and add comments describing the ContextStack and how it solves the out of order closing of contexts.

bantonsson · 2025-02-12T14:02:26Z

@cijothomas Do you know if the integration tests are flaky? The previous run failed but now it succeeded without any functional changes to the code.

cijothomas · 2025-02-12T17:33:07Z

@cijothomas Do you know if the integration tests are flaky? The previous run failed but now it succeeded without any functional changes to the code.

You can ignore it for now. Integration tests are not marked required for merging PRs. It does has some instability due to spinning up OTel Collector in a docker and multiple tests using the same instance etc. That will be addressed separately.

shaun-cox · 2025-02-13T23:33:11Z

Interesting. I too am curious to know a bit more about tokio-rs/tracing use cases that cause the context guards to be dropped out of order.

What is the expected behavior of the context stack if new contexts are pushed during out-of-order pops?

bantonsson · 2025-02-14T08:45:17Z

@shaun-cox So this is not for a normal use case, but to ensure that the Context can handle abuse by broken code. The expected behavior is that the current Context is the last entered and not exited Context. Any new push of a Context would be the new current Context.

The same type of mechanism exists in the Registry in tokio-rs/tracing-subscriber that has a stack as well.

This might be obvious but just to be clear, the pushing and popping of a Context is only about what Context is active at this thread at this time, and does not enforce any relationship between this and the previous Context. The relationships are created by the Span instances and their parent information.

shaun-cox · 2025-02-14T18:22:00Z

@shaun-cox So this is not for a normal use case, but to ensure that the Context can handle abuse by broken code. The expected behavior is that the current Context is the last entered and not exited Context. Any new push of a Context would be the new current Context.

This might be obvious but just to be clear, the pushing and popping of a Context is only about what Context is active at this thread at this time and does not enforce any relationship between this and the previous Context. The relationships are created by the Span instances and their parent information.

@bantonsson Thank you for the clarification... it is helpful. I was originally interpretting the associated bug and this PR as an indication that ContextGuards could be dropped in the wrong order by correctly functioning application code. My understanding now is that this change is kind of "defense-in-depth" change to allow for buggy application code to keep executing? If that is the case, it kind of goes against the spirit of idiomatic Rust (IMO) which says that one should panic at the first sign of a bug, so that it can be fixed. Is it not sufficient to avoid this complexity and associated hiding of application bugs and just add a # Panics section to ContextGuard which explains that they will panic if dropped in the wrong order? (Assuming it will panic, or that its relatively simple to make them panic if dropped in the wrong order?)

cijothomas · 2025-02-14T18:31:06Z

@shaun-cox So this is not for a normal use case, but to ensure that the Context can handle abuse by broken code. The expected behavior is that the current Context is the last entered and not exited Context. Any new push of a Context would be the new current Context.
This might be obvious but just to be clear, the pushing and popping of a Context is only about what Context is active at this thread at this time and does not enforce any relationship between this and the previous Context. The relationships are created by the Span instances and their parent information.

@bantonsson Thank you for the clarification... it is helpful. I was originally interpretting the associated bug and this PR as an indication that ContextGuards could be dropped in the wrong order by correctly functioning application code. My understanding now is that this change is kind of "defense-in-depth" change to allow for buggy application code to keep executing? If that is the case, it kind of goes against the spirit of idiomatic Rust (IMO) which says that one should panic at the first sign of a bug, so that it can be fixed. Is it not sufficient to avoid this complexity and associated hiding of application bugs and just add a # Panics section to ContextGuard which explains that they will panic if dropped in the wrong order? (Assuming it will panic, or that its relatively simple to make them panic if dropped in the wrong order?)

OTel Spec explicitly prohibit crash/panic/exception except when at the initialization stage. However, this is probably a place where Rust idiomatic ways supersede specs, and panic, as there is no "correct" way for us to resolve this user issue.

bantonsson · 2025-02-17T10:55:06Z

@shaun-cox I agree that dropping the ContextGuard out of order is something that should be an error/panic, but the main purpose of this PR is to allow OpenTelemetry to provide better interoperability with tokio-rs/tracing, and since this code in tokio-rs/tracing-subscriber explicitly allows for out of order closing of spans, we need to accomodate that on the OpenTelemetry side as well. It would be very bad if adding an OpenTelemetry compatibility layer to a working application caused it to panic.

Maybe we should add this lenient mode behind a feature flag?

mladedav · 2025-02-18T13:47:29Z

If that is the case, it kind of goes against the spirit of idiomatic Rust (IMO) which says that one should panic at the first sign of a bug

I don't think logging/tracing code should panic except during explicit calls in the setup phase.

and just add a # Panics section to ContextGuard

If possible, you should not panic in drop which could make programs crash even when you try catch unwinding panics. That documentation will also be harder to discover than other panics since drops are implicit.

If you decide that this is unnecessary complexity added just for broken programs, I think the current behavior (i.e. just broken spans but no crashes) with additional otel_error! would be preferable.

bantonsson · 2025-02-18T17:50:44Z

I think that if we want to have any chance of real interoperability with tokio-rs/tracing we will need to handle out of order context dropping (preferably in the same way), at least if we are to believe this old ticket tokio-rs/tracing#495 and this old fix tokio-rs/tracing@e2a3897 that is introducing this robustness behavior in tokio-rs/tracing.

shaun-cox · 2025-02-20T00:07:22Z

I might be missing the larger picture here too... what is meant by "real interoperability with tokio-rs/tracing"?

bantonsson · 2025-02-20T08:00:56Z

I might be missing the larger picture here too... what is meant by "real interoperability with tokio-rs/tracing"?

Thanks for pointing this out @shaun-cox. I see that my comment is lacking some context (pun intended).

Long story short, the existing tokio-rs/tracing-opentelemetry crate is not activating an OpenTelemetry Span to mimic the Tokio Span that is current on the thread, but instead carries around a SpanBuilder and only quickly start/stop the OpenTelemetry Span when the Tokio Span is closed. This leads to unexpected behavior when mixing the two APIs, as described in #1690. I'm currently working on #2420 to fix this, but a proper fix depends on being resilient to out of order context scope deactivation due to the behavior in tokio-rs/tracing introduced in tokio-rs/tracing@e2a3897 to fix tokio-rs/tracing#495.

bantonsson · 2025-02-26T16:35:59Z

Since the github action can't post any benchmark results on fork PRs and there is a metric benchmark that fails I'm just going to post the results for context_attach for now.

name	baseDuration	changesDuration	difference
'context_attach/nested_cx/empty_cx'	'49.3±0.23ns'	'27.9±0.25ns'	'-44'
'context_attach/nested_cx/single_value_cx'	'97.0±2.27ns'	'29.6±0.35ns'	'-69'
'context_attach/nested_cx/span_cx'	'55.5±0.37ns'	'29.3±0.29ns'	'-47'
'context_attach/out_of_order_cx_drop/empty_cx'	'51.1±0.39ns'	'37.5±0.10ns'	'-26'
'context_attach/out_of_order_cx_drop/single_value_cx'	'164.2±699.46ns'	'39.1±0.19ns'	'-76'
'context_attach/out_of_order_cx_drop/span_cx'	'61.0±0.30ns'	'39.4±0.28ns'	'-35'
'context_attach/single_cx/empty_cx'	'29.2±0.07ns'	'15.7±0.11ns'	'-46'
'context_attach/single_cx/single_value_cx'	'43.6±0.43ns'	'17.2±0.10ns'	'-60'
'context_attach/single_cx/span_cx'	'29.4±0.18ns'	'16.9±0.11ns'	'-43'

opentelemetry/src/context.rs

bantonsson · 2025-02-28T15:49:04Z

I've implemented the discussed behavior now.

opentelemetry/src/context.rs

cijothomas

Looks good for me. There are generally a lack of tests for Context area, but we should address that separately from this.
Context is not announced as RC/Stable yet, needs to finish tests coverage before that.

Also - I wonder if we could do the perf improvement in its own PR, separate from supporting out-of-order drops? Or is the perf gain strictly tied to supporting out-of-order drops? (Not a blocker)

cijothomas · 2025-03-01T20:10:55Z

@open-telemetry/rust-approvers PTAL. I could use 1-2 extra reviews before merging.

cijothomas · 2025-03-01T20:11:34Z

@bantonsson Could you update PR desc. with the perf results from the most recent run. Also add a changelog entry for this.

cijothomas · 2025-03-01T20:11:51Z

@mladedav Could you review this please?

lalitb · 2025-03-03T17:55:55Z

opentelemetry/src/context.rs

+            next_id as u16
+        } else {
+            // This is an overflow, log it and ignore it.
+            otel_warn!(


Would it make sense to attempt purging stale entries when the limit is reached before issuing a warning? Not for this PR, but possibly as a future enhancement.

I think the current limit is already quite high, so the simple protection is good enough in my opinion.

I agree that the limit is pretty high, but it would be better to still proactively clear out stale entries when it’s reached, especially for long-running applications. We could revisit this later as a future improvement after we gather more usage data or feedback. For now, I’m good with moving forward as-is. Thanks!

There are no stale entries in the stack. When an entry is removed out of order, it is replaced by a None, so there can be holes in the stack. Since the id in the ContextGuard is the position in the stack, we can't resize it until the topmost element is removed.

mladedav

Just small documentation nits but otherwise this looks good to me.

mladedav · 2025-03-03T20:08:38Z

opentelemetry/src/context.rs

+/// [`ContextGuard`] instances that are constructed using it can't be shared with
+/// other threads.


Nit, I think "can't be shared with other threads" written like this usually means !Sync while here the guards are just !Send.

I think changing it to "can't be moved to other threads" would be clearer.

mladedav · 2025-03-03T20:10:51Z

opentelemetry/src/context.rs

+/// The stack relies on the fact that it is thread local and that the
+/// [`ContextGuard`] instances that are constructed using it can't be shared with


This talks about guards constructed using the stack, but they're just stored there and not constructed using it? Or what does the "it" here refer to?

Yes, that is not clear. This refers to the id in the guards coming from the use of the stack. I'll rephrase it.

lalitb

LGTM. It would also be helpful to have a design doc for the current Context implementation here: https://github.com/open-telemetry/opentelemetry-rust/tree/main/docs/design. Something to consider for a separate PR.

bantonsson requested a review from a team as a code owner December 3, 2024 12:59

bantonsson force-pushed the ban/explore-context-stack branch from b64d904 to dd46520 Compare December 3, 2024 14:55

bantonsson mentioned this pull request Dec 12, 2024

[Prototype] OpenTelemetry and Tokio Tracing bridge that properly activates contexts and spans #2420

Open

bantonsson force-pushed the ban/explore-context-stack branch 2 times, most recently from ab2f259 to 4502611 Compare January 27, 2025 11:03

bantonsson force-pushed the ban/explore-context-stack branch 5 times, most recently from 828ba76 to ea537b8 Compare February 5, 2025 12:06

cijothomas added this to the 0.29 milestone Feb 8, 2025

paullegranddc reviewed Feb 11, 2025

View reviewed changes

opentelemetry/src/context.rs Outdated Show resolved Hide resolved

opentelemetry/benches/context_attach.rs Outdated Show resolved Hide resolved

opentelemetry/src/context.rs Outdated Show resolved Hide resolved

lalitb reviewed Feb 12, 2025

View reviewed changes

opentelemetry/src/context.rs Outdated Show resolved Hide resolved

bantonsson force-pushed the ban/explore-context-stack branch 2 times, most recently from ecb0279 to 94ddd7e Compare February 12, 2025 13:40

bantonsson force-pushed the ban/explore-context-stack branch from 94ddd7e to c202910 Compare February 14, 2025 14:12

bantonsson force-pushed the ban/explore-context-stack branch 2 times, most recently from 7c0988e to ca2501e Compare February 26, 2025 15:22

cijothomas reviewed Feb 27, 2025

View reviewed changes

opentelemetry/src/context.rs Show resolved Hide resolved

cijothomas reviewed Feb 27, 2025

View reviewed changes

opentelemetry/src/context.rs Show resolved Hide resolved

cijothomas reviewed Feb 27, 2025

View reviewed changes

opentelemetry/src/context.rs Outdated Show resolved Hide resolved

bantonsson force-pushed the ban/explore-context-stack branch 2 times, most recently from 5fe5181 to 7725b37 Compare February 28, 2025 15:48

cijothomas reviewed Feb 28, 2025

View reviewed changes

opentelemetry/src/context.rs Show resolved Hide resolved

cijothomas reviewed Feb 28, 2025

View reviewed changes

opentelemetry/src/context.rs Outdated Show resolved Hide resolved

cijothomas added the performance label Feb 28, 2025

bantonsson force-pushed the ban/explore-context-stack branch from 7725b37 to c97dc97 Compare February 28, 2025 16:28

cijothomas approved these changes Mar 1, 2025

View reviewed changes

bantonsson force-pushed the ban/explore-context-stack branch 3 times, most recently from 0799947 to 20ecab7 Compare March 3, 2025 13:25

lalitb reviewed Mar 3, 2025

View reviewed changes

mladedav approved these changes Mar 3, 2025

View reviewed changes

lalitb approved these changes Mar 4, 2025

View reviewed changes

bantonsson added 3 commits March 4, 2025 09:05

test: Add test for overlapping contexts

2e081cf

fix: Change thread local context to allow overlapped scopes

aacf87d

perf: Optimize cloning of Context since entries are immutable

08ec6e1

bantonsson force-pushed the ban/explore-context-stack branch from 20ecab7 to 08ec6e1 Compare March 4, 2025 08:10

Merge branch 'main' into ban/explore-context-stack

72f1e3a

cijothomas merged commit baf4bfd into open-telemetry:main Mar 5, 2025
23 checks passed

bantonsson mentioned this pull request Apr 9, 2025

REQUEST: New membership for @bantonsson open-telemetry/community#2649

Closed

6 tasks

		/// [`ContextGuard`] instances that are constructed using it can't be shared with
		/// other threads.

		/// The stack relies on the fact that it is thread local and that the
		/// [`ContextGuard`] instances that are constructed using it can't be shared with

fix: Allow overlapping context scopes #2378

fix: Allow overlapping context scopes #2378

Uh oh!

Conversation

bantonsson commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Merge requirement checklist

Uh oh!

codecov bot commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cijothomas commented Feb 12, 2025

Uh oh!

Uh oh!

bantonsson commented Feb 12, 2025

Uh oh!

bantonsson commented Feb 12, 2025

Uh oh!

cijothomas commented Feb 12, 2025

Uh oh!

shaun-cox commented Feb 13, 2025

Uh oh!

bantonsson commented Feb 14, 2025

Uh oh!

shaun-cox commented Feb 14, 2025

Uh oh!

cijothomas commented Feb 14, 2025

Uh oh!

bantonsson commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mladedav commented Feb 18, 2025

Uh oh!

bantonsson commented Feb 18, 2025

Uh oh!

shaun-cox commented Feb 20, 2025

Uh oh!

bantonsson commented Feb 20, 2025

Uh oh!

bantonsson commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bantonsson commented Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

cijothomas left a comment

Choose a reason for hiding this comment

Uh oh!

cijothomas commented Mar 1, 2025

Uh oh!

cijothomas commented Mar 1, 2025

Uh oh!

cijothomas commented Mar 1, 2025

Uh oh!

lalitb Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cijothomas Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

lalitb Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

bantonsson Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

mladedav left a comment

Choose a reason for hiding this comment

Uh oh!

mladedav Mar 3, 2025

Choose a reason for hiding this comment

bantonsson commented Dec 3, 2024 •

edited

Loading

codecov bot commented Dec 3, 2024 •

edited

Loading

bantonsson commented Feb 17, 2025 •

edited

Loading

lalitb Mar 3, 2025 •

edited

Loading