Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Allow overlapping context scopes #2378

Merged

Conversation

bantonsson
Copy link
Contributor

@bantonsson bantonsson commented Dec 3, 2024

Fixes #1887

Changes

This PR changes the single current Context to a more robust stack that is resilient to out of order and overlapping Context scopes. This robustness is needed to be able to support better interop with tokio-rs/tracing.

There is also an optimization based on the fact that the Map in the Context is immutable, and these are the performance benchmarks for all the context related benchmarks.

name baseDuration changesDuration difference %
context/has_active_span/in-cx/alt 8.6±1.15ns 8.4±0.15ns -2.0
context/has_active_span/in-cx/spec 16.5±0.06ns 16.5±0.05ns 0.0
context/has_active_span/no-cx/alt 8.4±0.01ns 8.4±0.02ns 0.0
context/has_active_span/no-cx/spec 15.6±0.05ns 15.2±0.07ns -2.0
context/has_active_span/no-sdk/alt 8.4±0.02ns 8.4±0.04ns 0.0
context/has_active_span/no-sdk/spec 15.6±0.06ns 15.2±0.05ns -2.0
context/is_recording/in-cx/alt 4.7±0.05ns 4.7±0.07ns 0.0
context/is_recording/in-cx/spec 18.1±0.12ns 17.8±0.13ns -2.0
context/is_recording/no-cx/alt 4.7±0.05ns 4.7±0.06ns 0.0
context/is_recording/no-cx/spec 15.9±0.06ns 15.9±0.12ns 0.0
context/is_recording/no-sdk/alt 4.7±0.05ns 4.7±0.04ns 0.0
context/is_recording/no-sdk/spec 15.9±0.06ns 15.9±0.08ns 0.0
context/is_sampled/in-cx/alt 8.7±0.18ns 8.7±0.02ns 0.0
context/is_sampled/in-cx/spec 16.8±0.03ns 16.5±0.04ns -2.0
context/is_sampled/no-cx/alt 8.7±0.03ns 8.7±0.06ns 0.0
context/is_sampled/no-cx/spec 15.6±0.11ns 15.3±0.14ns -2.0
context/is_sampled/no-sdk/alt 8.7±0.03ns 8.7±0.04ns 0.0
context/is_sampled/no-sdk/spec 15.6±0.11ns 15.2±0.07ns -2.0
context_attach/nested_cx/empty_cx 48.6±0.31ns 29.4±0.26ns -39
context_attach/nested_cx/single_value_cx 96.9±1.66ns 30.8±0.11ns -68
context_attach/nested_cx/span_cx 55.7±0.21ns 30.9±0.28ns -44
context_attach/out_of_order_cx_drop/empty_cx 51.6±0.60ns 38.0±0.27ns -26
context_attach/out_of_order_cx_drop/single_value_cx 161.5±673.14ns 39.5±0.79ns -76
context_attach/out_of_order_cx_drop/span_cx 61.0±0.37ns 39.6±0.29ns -35
context_attach/single_cx/empty_cx 29.3±0.23ns 16.1±0.09ns -45
context_attach/single_cx/single_value_cx 43.5±1.09ns 17.6±0.09ns -60
context_attach/single_cx/span_cx 29.5±0.14ns 17.2±0.12ns -42

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@bantonsson bantonsson requested a review from a team as a code owner December 3, 2024 12:59
Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 97.26027% with 4 lines in your changes missing coverage. Please review.

Project coverage is 79.6%. Comparing base (ff33638) to head (72f1e3a).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry/src/context.rs 97.2% 4 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##            main   #2378    +/-   ##
======================================
  Coverage   79.5%   79.6%            
======================================
  Files        123     123            
  Lines      22923   23041   +118     
======================================
+ Hits       18242   18356   +114     
- Misses      4681    4685     +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bantonsson bantonsson force-pushed the ban/explore-context-stack branch from b64d904 to dd46520 Compare December 3, 2024 14:55
@bantonsson bantonsson force-pushed the ban/explore-context-stack branch 2 times, most recently from ab2f259 to 4502611 Compare January 27, 2025 11:03
@bantonsson bantonsson force-pushed the ban/explore-context-stack branch 5 times, most recently from 828ba76 to ea537b8 Compare February 5, 2025 12:06
@cijothomas cijothomas added this to the 0.29 milestone Feb 8, 2025
@cijothomas
Copy link
Member

@bantonsson Can you add some details on how this fixes the issue of contexts being dropped out-of-order? Perhaps some code comments will help future readers.

@bantonsson
Copy link
Contributor Author

@cijothomas I'll clean this up a bit more and add comments describing the ContextStack and how it solves the out of order closing of contexts.

@bantonsson bantonsson force-pushed the ban/explore-context-stack branch 2 times, most recently from ecb0279 to 94ddd7e Compare February 12, 2025 13:40
@bantonsson
Copy link
Contributor Author

@cijothomas Do you know if the integration tests are flaky? The previous run failed but now it succeeded without any functional changes to the code.

@cijothomas
Copy link
Member

@cijothomas Do you know if the integration tests are flaky? The previous run failed but now it succeeded without any functional changes to the code.

You can ignore it for now. Integration tests are not marked required for merging PRs. It does has some instability due to spinning up OTel Collector in a docker and multiple tests using the same instance etc. That will be addressed separately.

@shaun-cox
Copy link
Contributor

Interesting. I too am curious to know a bit more about tokio-rs/tracing use cases that cause the context guards to be dropped out of order.

What is the expected behavior of the context stack if new contexts are pushed during out-of-order pops?

@bantonsson
Copy link
Contributor Author

@shaun-cox So this is not for a normal use case, but to ensure that the Context can handle abuse by broken code. The expected behavior is that the current Context is the last entered and not exited Context. Any new push of a Context would be the new current Context.

The same type of mechanism exists in the Registry in tokio-rs/tracing-subscriber that has a stack as well.

This might be obvious but just to be clear, the pushing and popping of a Context is only about what Context is active at this thread at this time, and does not enforce any relationship between this and the previous Context. The relationships are created by the Span instances and their parent information.

@bantonsson bantonsson force-pushed the ban/explore-context-stack branch from 94ddd7e to c202910 Compare February 14, 2025 14:12
@shaun-cox
Copy link
Contributor

@shaun-cox So this is not for a normal use case, but to ensure that the Context can handle abuse by broken code. The expected behavior is that the current Context is the last entered and not exited Context. Any new push of a Context would be the new current Context.

This might be obvious but just to be clear, the pushing and popping of a Context is only about what Context is active at this thread at this time and does not enforce any relationship between this and the previous Context. The relationships are created by the Span instances and their parent information.

@bantonsson Thank you for the clarification... it is helpful. I was originally interpretting the associated bug and this PR as an indication that ContextGuards could be dropped in the wrong order by correctly functioning application code. My understanding now is that this change is kind of "defense-in-depth" change to allow for buggy application code to keep executing? If that is the case, it kind of goes against the spirit of idiomatic Rust (IMO) which says that one should panic at the first sign of a bug, so that it can be fixed. Is it not sufficient to avoid this complexity and associated hiding of application bugs and just add a # Panics section to ContextGuard which explains that they will panic if dropped in the wrong order? (Assuming it will panic, or that its relatively simple to make them panic if dropped in the wrong order?)

@cijothomas
Copy link
Member

@shaun-cox So this is not for a normal use case, but to ensure that the Context can handle abuse by broken code. The expected behavior is that the current Context is the last entered and not exited Context. Any new push of a Context would be the new current Context.
This might be obvious but just to be clear, the pushing and popping of a Context is only about what Context is active at this thread at this time and does not enforce any relationship between this and the previous Context. The relationships are created by the Span instances and their parent information.

@bantonsson Thank you for the clarification... it is helpful. I was originally interpretting the associated bug and this PR as an indication that ContextGuards could be dropped in the wrong order by correctly functioning application code. My understanding now is that this change is kind of "defense-in-depth" change to allow for buggy application code to keep executing? If that is the case, it kind of goes against the spirit of idiomatic Rust (IMO) which says that one should panic at the first sign of a bug, so that it can be fixed. Is it not sufficient to avoid this complexity and associated hiding of application bugs and just add a # Panics section to ContextGuard which explains that they will panic if dropped in the wrong order? (Assuming it will panic, or that its relatively simple to make them panic if dropped in the wrong order?)

OTel Spec explicitly prohibit crash/panic/exception except when at the initialization stage. However, this is probably a place where Rust idiomatic ways supersede specs, and panic, as there is no "correct" way for us to resolve this user issue.

@bantonsson
Copy link
Contributor Author

bantonsson commented Feb 17, 2025

@shaun-cox I agree that dropping the ContextGuard out of order is something that should be an error/panic, but the main purpose of this PR is to allow OpenTelemetry to provide better interoperability with tokio-rs/tracing, and since this code in tokio-rs/tracing-subscriber explicitly allows for out of order closing of spans, we need to accomodate that on the OpenTelemetry side as well. It would be very bad if adding an OpenTelemetry compatibility layer to a working application caused it to panic.

Maybe we should add this lenient mode behind a feature flag?

@mladedav
Copy link

If that is the case, it kind of goes against the spirit of idiomatic Rust (IMO) which says that one should panic at the first sign of a bug

I don't think logging/tracing code should panic except during explicit calls in the setup phase.

and just add a # Panics section to ContextGuard

If possible, you should not panic in drop which could make programs crash even when you try catch unwinding panics. That documentation will also be harder to discover than other panics since drops are implicit.

If you decide that this is unnecessary complexity added just for broken programs, I think the current behavior (i.e. just broken spans but no crashes) with additional otel_error! would be preferable.

@bantonsson
Copy link
Contributor Author

I think that if we want to have any chance of real interoperability with tokio-rs/tracing we will need to handle out of order context dropping (preferably in the same way), at least if we are to believe this old ticket tokio-rs/tracing#495 and this old fix tokio-rs/tracing@e2a3897 that is introducing this robustness behavior in tokio-rs/tracing.

@shaun-cox
Copy link
Contributor

I might be missing the larger picture here too... what is meant by "real interoperability with tokio-rs/tracing"?

@bantonsson
Copy link
Contributor Author

I might be missing the larger picture here too... what is meant by "real interoperability with tokio-rs/tracing"?

Thanks for pointing this out @shaun-cox. I see that my comment is lacking some context (pun intended).

Long story short, the existing tokio-rs/tracing-opentelemetry crate is not activating an OpenTelemetry Span to mimic the Tokio Span that is current on the thread, but instead carries around a SpanBuilder and only quickly start/stop the OpenTelemetry Span when the Tokio Span is closed. This leads to unexpected behavior when mixing the two APIs, as described in #1690. I'm currently working on #2420 to fix this, but a proper fix depends on being resilient to out of order context scope deactivation due to the behavior in tokio-rs/tracing introduced in tokio-rs/tracing@e2a3897 to fix tokio-rs/tracing#495.

@bantonsson bantonsson force-pushed the ban/explore-context-stack branch 3 times, most recently from 7c0988e to ca2501e Compare February 26, 2025 15:22
@bantonsson
Copy link
Contributor Author

Since the github action can't post any benchmark results on fork PRs and there is a metric benchmark that fails I'm just going to post the results for context_attach for now.

name baseDuration changesDuration difference
'context_attach/nested_cx/empty_cx' '49.3±0.23ns' '27.9±0.25ns' '-44'
'context_attach/nested_cx/single_value_cx' '97.0±2.27ns' '29.6±0.35ns' '-69'
'context_attach/nested_cx/span_cx' '55.5±0.37ns' '29.3±0.29ns' '-47'
'context_attach/out_of_order_cx_drop/empty_cx' '51.1±0.39ns' '37.5±0.10ns' '-26'
'context_attach/out_of_order_cx_drop/single_value_cx' '164.2±699.46ns' '39.1±0.19ns' '-76'
'context_attach/out_of_order_cx_drop/span_cx' '61.0±0.30ns' '39.4±0.28ns' '-35'
'context_attach/single_cx/empty_cx' '29.2±0.07ns' '15.7±0.11ns' '-46'
'context_attach/single_cx/single_value_cx' '43.6±0.43ns' '17.2±0.10ns' '-60'
'context_attach/single_cx/span_cx' '29.4±0.18ns' '16.9±0.11ns' '-43'

@bantonsson bantonsson force-pushed the ban/explore-context-stack branch 2 times, most recently from 5fe5181 to 7725b37 Compare February 28, 2025 15:48
@bantonsson
Copy link
Contributor Author

I've implemented the discussed behavior now.

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for me. There are generally a lack of tests for Context area, but we should address that separately from this.
Context is not announced as RC/Stable yet, needs to finish tests coverage before that.

Also - I wonder if we could do the perf improvement in its own PR, separate from supporting out-of-order drops? Or is the perf gain strictly tied to supporting out-of-order drops? (Not a blocker)

@cijothomas
Copy link
Member

@open-telemetry/rust-approvers PTAL. I could use 1-2 extra reviews before merging.

@cijothomas
Copy link
Member

@bantonsson Could you update PR desc. with the perf results from the most recent run. Also add a changelog entry for this.

@cijothomas
Copy link
Member

@mladedav Could you review this please?

@bantonsson bantonsson force-pushed the ban/explore-context-stack branch 3 times, most recently from 0799947 to 20ecab7 Compare March 3, 2025 13:25
next_id as u16
} else {
// This is an overflow, log it and ignore it.
otel_warn!(
Copy link
Member

@lalitb lalitb Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to attempt purging stale entries when the limit is reached before issuing a warning? Not for this PR, but possibly as a future enhancement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current limit is already quite high, so the simple protection is good enough in my opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the limit is pretty high, but it would be better to still proactively clear out stale entries when it’s reached, especially for long-running applications. We could revisit this later as a future improvement after we gather more usage data or feedback. For now, I’m good with moving forward as-is. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no stale entries in the stack. When an entry is removed out of order, it is replaced by a None, so there can be holes in the stack. Since the id in the ContextGuard is the position in the stack, we can't resize it until the topmost element is removed.

Copy link

@mladedav mladedav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small documentation nits but otherwise this looks good to me.

Comment on lines 399 to 400
/// [`ContextGuard`] instances that are constructed using it can't be shared with
/// other threads.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I think "can't be shared with other threads" written like this usually means !Sync while here the guards are just !Send.

I think changing it to "can't be moved to other threads" would be clearer.

Comment on lines 398 to 399
/// The stack relies on the fact that it is thread local and that the
/// [`ContextGuard`] instances that are constructed using it can't be shared with
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This talks about guards constructed using the stack, but they're just stored there and not constructed using it? Or what does the "it" here refer to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is not clear. This refers to the id in the guards coming from the use of the stack. I'll rephrase it.

Copy link
Member

@lalitb lalitb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It would also be helpful to have a design doc for the current Context implementation here: https://github.com/open-telemetry/opentelemetry-rust/tree/main/docs/design. Something to consider for a separate PR.

@bantonsson bantonsson force-pushed the ban/explore-context-stack branch from 20ecab7 to 08ec6e1 Compare March 4, 2025 08:10
@cijothomas cijothomas merged commit baf4bfd into open-telemetry:main Mar 5, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Overlapping contexts
7 participants