Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set parent spans sensibly when making requests #1073

Closed
wants to merge 3 commits into from
Closed

Set parent spans sensibly when making requests #1073

wants to merge 3 commits into from

Conversation

pkoenig10
Copy link
Member

@pkoenig10 pkoenig10 commented Apr 20, 2019

This is an attempt at a simpler implementation of #1069. I wrote this and then realized the other PR existed. But this is sufficiently different so I figured I'd put it up anyways to get feedback.

Before this PR

All acquire-limiter-* and execute-* spans have the callers span as their parentSpanId. This makes it very difficult to group spans for a request together when multiple requests are being made simultaneously on the same trace. As an example, here is a trace I was viewing recently:

Screen Shot 2019-04-19 at 3 06 13 PM

A couple other issues with the current implementation:

  • The acquire-limiter-run span is useless. It always completes immediately and is just extra noise in the trace logs
  • Having both enqueue and run spans for both the acquire-limiter and execute is a bit confusing. It is not immediately clear to most users what is meant by "execute" and what tasks are being enqueued and then run.

After this PR

This PR makes a couple changes:

  1. Create a top-level enqueue span around submitting the request. This is useful to provide a single span under which all request spans will exist. This is the only span whose parent is the caller's span.
  2. Remove the execute-* async span and instead nest the enqueueing of the request in the limiter-run span. This make the limiter-run span no longer useless since it contains the actual request.

The first commit in this PR just updates the tracing test to clearly show how spans are parented. This should make it easier to understand the behavior changes as a result of this change.

Possible downsides?

The top-level enqueue span will be very short and will not outlive the request. Making this span live as long as the request is a bit error-prone since the span would have to be started, left open, and then completed on another thread. Getting the spans to work out in this way is tricky.


assertThat(spans).containsExactly(
span(spans.get(0), traceId, spanIds.get(2), SpanType.LOCAL, "OkHttp: limiter-enqueue"),
span(spans.get(1), traceId, spanIds.get(2), SpanType.LOCAL, "OkHttp: limiter-run"),
Copy link
Member Author

@pkoenig10 pkoenig10 Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just remove this span entirely and have the dispatcher span be directly under the enqueue span. This would require adding a method to the AsyncTracer that only completes the async span:

public void completeSpan() {
    Trace originalTrace = Tracer.getAndClearTrace();
    Tracer.setTrace(deferredTrace);
    // Finish the enqueue span
    Tracer.fastCompleteSpan();
    Tracer.setTrace(originalTrace);
}

Copy link
Member Author

@pkoenig10 pkoenig10 Apr 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could modify the AsyncTracer to allow for different operation names for the "execute" and "run" spans to improve the naming (something like "limiter" and "execute").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to remove/deprecate AsyncTracer in favor of proper APIs to support what we want to measure. palantir/tracing-java#124

span(spans.get(1), traceId, spanIds.get(2), SpanType.LOCAL, "OkHttp: limiter-run"),
span(spans.get(2), traceId, parentSpanId, SpanType.LOCAL, "OkHttp: enqueue"),
span(spans.get(3), traceId, spanIds.get(4), SpanType.CLIENT_OUTGOING, "OkHttp: GET /{param}"),
span(spans.get(4), traceId, spanIds.get(1), SpanType.LOCAL, "OkHttp: dispatcher"));
Copy link
Contributor

@diogoholanda diogoholanda Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to me that this span is parented by the useless OkHttp: limiter-run. In my opinion, OkHttp: limiter-run shouldn't even exist. If it does in the meantime, it shouldn't be the parent span of anything.
I also like the approach on #1069 of just removing the Dispatcher span altogether and going straight to the CLIENT_OUTGOING span, which should be a sibling of the OkHttp: limiter-* spans

Copy link
Member Author

@pkoenig10 pkoenig10 Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, OkHttp: limiter-run shouldn't even exist

I agree. Ideally there would just be a single span to track the time spent waiting for the limiter. But doing so requires a change to `AsyncTracer. See my comment at #1073 (comment).


I also like the approach on #1069 of just removing the Dispatcher span altogether

I don't disagree, but there needs to be some mechanism to carry the trace over to the dispatcher thread. This should probably happen before anything runs (ie not in an interceptor). I don't think it unreasonable to keep the dispatcher span since

  1. We instrument all other executors in this same way
  2. There is value in tracking the duration of tasks running in this executor. This helps us know if we are spending a lot of time in a dispatcher thread doing work outside the request span.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first point is that the second best thing for OkHttp: limiter-run not existing is OkHttp: limiter-run being a no-op and having no children. We can always remove it later once we fix AsyncTrancer.

ListenableFuture<Limiter.Listener> limiterListener = limiter.acquire();
request().tag(ConcurrencyLimiterListener.class).setLimiterListener(limiterListener);
Futures.addCallback(limiterListener, new FutureCallback<Limiter.Listener>() {
@Override
public void onSuccess(Limiter.Listener listener) {
tracer.withTrace(() -> null);
enqueueInternal(callback);
tracer.withTrace(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you shouldn't make this change (given my comment above on the test)

@@ -155,14 +156,22 @@ private static ResponseBody buffer(MediaType mediaType, byte[] bodyBytes) {

@Override
public void enqueue(Callback callback) {
AsyncTracer tracer = new AsyncTracer("OkHttp: acquire-limiter");
try (CloseableTracer ignored = CloseableTracer.startSpan("OkHttp: enqueue")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this lead to a very fast span (just enqueing) which is the parent of much longer spans (the actual call)? Ideally, the duration of a span would contain all its children. I would suggest either removing this, or having a span that actually lasts for the whole duration of the request (which I believe is something that #1069 does and was one of my original reasons to start discussing that change with @robert3005)

Copy link
Member Author

@pkoenig10 pkoenig10 Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the duration of a span would contain all its children

Ideally yes, but the more I think about it, the less important I think this is. Ideally a span's end time should:

  1. Be after the end time of it's last child
  2. Be before the start time of it's next sibling

However with async tasks, we can only pick one of these. I would argue that sibling spans occuring at the same time on the same trace is even more confusing than a parent that does not outlive it's child. Especially since that is exactly what is happening when work is done asynchronously - the parent finished submitting work before the work is done.


I would suggest either removing this

We need something here under which all other request spans can exist. Otherwise all these spans are parented by the caller's span which leads to the confusion I detail in the PR description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think it is important is because if a span has a very small duration, I assume I don't need to expand it to check its children. It seems like a fair assumption to make, otherwise I would always have to expand all spans, no matter how long they are. I actually feel somewhat strongly about trying to keep a span's duration expanding through all its children duration.

If you have reasons not to remove this span, could you make it cover the whole request, from limiter to the actual response coming back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree with you. This is actually the behavior I wanted when I started to make this changes. But doing this with the current API is very error-prone since it requires:

  1. Calling startSpan at the beginning of the request
  2. Then not closing the span in order to keep in on the stack (this is different than the behavior of AsyncTracer, which immediately discards this span)
  3. Calling fastCompleteSpan after the request is done, assuming that the span on the top of the stack is the one we started in step (1).

There are a couple things we can do to mitigate this:

Copy link
Member Author

@pkoenig10 pkoenig10 Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this actually requires removing the dispatcher span, otherwise (3) will never be true since the dispatcher span will always be on the top of the stack. But then you need another mechanism to transfer the trace from the submitting thread to the dispatcher thread. And preferably one that captures as much of the dispatcher work as possible (ideally, you would capture all of it, which is what the wrapped executor does).

This actually doesn't work. Since palantir/tracing-java#97, DeferredTracer no longer keeps the span stack of the caller but only keeps the parentSpanId.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design a better API to to safely support these types of spans (I think that's what palantir/tracing-java#124 is attempting to do)

That's correct

Design a better API that verifies the span on the top of the stack is an expected span when closing

Tracking here: palantir/tracing-java#34

@iamdanfox
Copy link
Contributor

Closing out because I hope 4.36.0 will satisfy the outcome: https://github.com/palantir/conjure-java-runtime/releases/tag/4.36.0

@iamdanfox iamdanfox closed this Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants