-
Notifications
You must be signed in to change notification settings - Fork 2k
Fix broken Otel propagation in dspy ParallelExecutor #8505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR ensures OpenTelemetry context is propagated to threads in ParallelExecutor
, fixes MLflow trace hierarchy, and adds a test to verify the behavior.
- Introduce
_with_otel_context
decorator and apply it to the worker function inParallelExecutor
- Add
test_otel_context_propagation
to cover OpenTelemetry context forwarding - Update
pyproject.toml
to include OpenTelemetry dependencies
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/utils/test_parallelizer.py | Added OpenTelemetry context propagation test |
pyproject.toml | Added opentelemetry-api and opentelemetry-sdk |
dspy/utils/parallelizer.py | Implemented _with_otel_context and applied decorator to worker |
Comments suppressed due to low confidence (1)
tests/utils/test_parallelizer.py:64
- The test references
ParallelExecutor
but does not import it, causing a NameError. Add:from dspy.utils.parallelizer import ParallelExecutor
.
@pytest.mark.skipif(not importlib.util.find_spec("opentelemetry"), reason="OpenTelemetry not installed")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomeHirata I am a bit unsure about this one. DSPy, as an authoring framework may not want to be aware of the downstream tracing library. In the other language, I kinda want to avoid having opentelemetry code inside DSPy critical paths. As an analogy, PyTorch/Tensorflow development doesn't keep aware of Lightning/Keras despite of the close integration. If we want to maintain the hierarchy of the tracing spans, maybe we should patch it at the mlflow side? The implementation is a bit tricky though.
Btw, I am also feeling a bit strange to display all modules' tracing under the same Parallel span, since normally that represents the time order, but these child spans are equivalent in this case.
I completely share the same feeling, and I first looked into addressing this on the callback side. However, the copying of otel context or contextvars needs to happen when child threads are spawned (in PerallelExecuter._execute_parallel). Other libraries tackled this issue by copying the context when running multi-threading (https://github.com/langchain-ai/langchain/blob/73552883c370b72b109910dcb4a85bdea786af90/libs/core/langchain_core/runnables/config.py#L166), but this would break the DSPy setting, unfortunately. Of course, I'm open to any better strategies as long as we can track the tree-structure correctly.
Parent-child relationship and absolute execution time are slightly different. It's correct to have module executions just under the parallelizer call since they are invoked directly by the top level. The absolute execution time is stored as a span metadata and the timeline view can tell they are executed in parallel. |
Yes that part is good, I was actually talking about a different thing - in the screenshot in the PR description, there are two predict spans under I need to think a bit more about the fix, definitely a challenging one. |
It's true in a single-thread case, but not always true in multi-threading. In the multi-threading case, the invocation stack can be stored as a tree structure of spans, but we need to check the timeline view to understand when these spans are executed. |
Currently,
dspy.Parallel
has an issue with the tree structure of MLflow tracing gets broken whendspy.Parallel
is used. This is because Dspy doesn't copy the parent context when workers are executed in parallel. It is intentional not to copy the entire contextvars in order to maintain different contextvar across different threads. This PR manually propagates the Otel Context to child threads whenParallelExecutor
is executed to fix the tracing issue.