Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Distributed Tracing semantics #1066

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

maxgolov
Copy link
Contributor

@maxgolov maxgolov commented Oct 31, 2022

Add Distributed Tracing semantics to the following APIs:

  • ISemanticContext - allows to set (optional) Distributed Tracing context on all events emitted by ILogger
  • ability to use env_dt aliases to customize TraceContext of individual events running in a given scope. That allows manual nesting of a child span to a parent span.
  • semantic context projection for C++/CL/CX apps on Windows
  • add test that emits two events with DT semantics

PR adds "common" fields:

  • env_dt_traceId
  • env_dt_spanId
  • env_dt_traceFlags
  • Part B parentId

And the set of corresponding high-level API setters at context-level:

  • SetTraceId
  • SetSpanId
  • SetTraceFlags
  • SetParentId

which are styled in the same way as other common context setters are presently styled, in uniform fashion.

How to use it:

  • events land in Aria-Kusto (Aria schema) or Custom Kusto (common schema), and could be subsequently routed to Geneva via 1DS-Interchange (Geneva schema).
  • events landing in Geneva have the required ext.dt extension semantics, just enough for the Trace Explorer view to work.. Less duration semantics: since 1DS in principle does not require client events to have "duration", start / end / scope. Events are timepoints that carry timestamp and name. How exactly to remap these, i.e. what would/could/should be considered duration - is out-of-scope of this PR. A separate routing example would be provided elsewhere, out-of-scope for Client Telemetry repo.

How this could be improved further:

  • There's no helper for formatting individual W3C TraceContext fields here. Users may use whatever standard formatter of their choice, e.g. taking a dependency on OpenTelemetry API for well-formed context of trace parent header values. In our org scenario we don't need it - since we are utilizing existing .NET APIs that already give us the access to values in the proper format. So we're only calling into 1DS C++ SDK via projection to set and send the well-formed values we need. We do not require manual formatting of these values in C++ land at all.
  • Should someone need to learn how to do this formatting in C++ - please refer to the OpenTelemetry C++ code here, that illustrates how to format TraceId and SpanId to hexadecimal string values.

Example showing how events look when traced from client to service in one Kusto table:
image

APITest verifies the basic functionality. More complex usage examples are not presently a priority for our team, since in our case we use the higher-level .NET StartActivity API while interacting with remote service instrumented with OpenTelemetry, and supplement the existing client telemetry events with correlation identifiers that follow Common Schema naming, specifically Common Schema -to- Geneva mapping.

Separate additional work could be done later - to add Span and Log semantics, allowing to use 1DS SDK as a lower-level transport channel / exporter for OpenTelemetry SDK. Presently this additional work is out-of-scope.

- ISemanticContext - for all events emitted by ILogger
- ability to use env_dt aliases on individual events
- semantic context projection for C++/CL/CX apps on Windows
- add test that emits two events with DT semantics
/// Set the W3C TraceContext TraceFlags information of telemetry event.
/// </summary>
/// <param name="TraceFlags">TraceContext TraceFlags</param>
virtual void SetTraceFlags(uint8_t traceFlags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the current spec only supports one flag (sampled) so I wonder if we should rather:

  • add a SetSampled(bool) method, or
  • change this to take a bitfield, or
  • provide an enum or a TRACEFLAGS_SAMPLED constant to use here instead of '1'

(the last is probably best since it doesn't constrain the caller)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W3C spec and Common Schema support traceFlags:

image

An integer representation of the W3C TraceContext trace-flags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I am saying that we could have

struct TraceFlags {
  unsigned char sampled: 1,
  unsigned char reserved: 7
};
...
virtual void SetTraceFlags(TraceFlags traceFlags)
...
TraceFlags flags;
flags.sampled = 1;
context.SetTraceFlags(flags);

or rather

enum TraceFlags {
  Sampled: 1,
};
...
context.SetTraceFlags(TraceFlags::Sampled);

rather than passing a raw uint8_t.

Copy link
Contributor Author

@maxgolov maxgolov Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could later on add any additional method overloads we want. Indeed, we could add yet another method what you're proposing. I would rather go with an additional method that just does SetTraceContext. My goal was to expose the bare min set of flags, as per spec:

  • in Common Schema spec this is int
  • in W3C TraceContext spec this described as byte / uint8_t

Thus, I followed the W3C TraceContext guidelines in this regard. You can see that other implementations, e.g. Elastic Node Traceparent package - did not elaborate on the meaning of the flags field. It's just flags, number eventually encoded as hex. No more no less than what the 1DS (being the transport layer) is expecting to get.

We can add more decoration later. It's add-ons, not the necessity. I'd like to avoid the semantics that could possibly be later on borrowed from elsewhere, e.g. these semantics all exists in finer detail in OpenTelemetry API. The goal of this PR is to cover the bare min set of transport level detail, in order to enable the POC work that we're doing.. Incremental changes could come later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good not to over-specify.

#define COMMONFIELDS_DT_TRACEID "env_dt_traceId"
#define COMMONFIELDS_DT_SPANID "env_dt_spanId"
#define COMMONFIELDS_DT_TRACEFLAGS "env_dt_traceFlags"
#define COMMONFIELDS_DT_PARENTID "parentId"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that parentId is meant to be used in events with Span semantics. Here we are adding it to every single event instead, which looks a bit weird. Are we doing this because supporting explicit Span semantics is more complicated? Would it cause problems if we decide to support explicit Span semantics later?

Copy link
Contributor Author

@maxgolov maxgolov Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two separate logical destinations here:

  • first three element: traceId, spanId, traceFlags - is our current TraceContext, what we would've sent in traceparent header. That's what spec is describing.

  • the last one: parentId - is what we receive as our parent, describes current-to-parent relationship, i.e. what initiated us to run. Our span's parent. Developer focusing on client telemetry debugging would see this on one event. This is not part of traceparent sent to the remote end, but part of client telemetry that the our code emits. That field allows us to stitch together the caller (our parent) and our current span into tree view.. Without this information attached on event - the Trace View explorer UX won't be able to stitch the two operations on time ladder chart..

So:

  • spanid -- is our current span id. Part of W3C TraceContext spec.
  • parentId -- is our parent, our remote span id.

parentId (prev) -> spanId (current) -> then remote service would generate its own $spanId(of service) (next)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How you would use it:

  • Obtain the ILogger associated with logging some service information, i.e.
    auto apiLogger = GetLogger("CatalogServiceAPILogger")
    This is similar to how ILogger gets allocated for a component in .NET.

  • Then you SetContext(...) on that logger's semantic context. All events coming from that logger are associated with the service interaction. This is all that's necessary for the Distributed Trace Explorer view to show your events coming from that logger on a time ladder chart.

  • When your client starts interacting with catalog service via REST API call, you'd initiate your transaction with some initial seed for the traceparent header: your traceId, spanId, traceFlags - the ones you set on context... And you pass it to remote service.

  • Your client emits events using apiLogger->LogEvent -- all of these events get indexed by correlation platform and contain the necessary semantics to render them in a trace view.

  • If you need to re-initialize the session, i.e. user logged out / logged in, you'd have the new transaction - traceId, and you'd update the context on that logger. All further events now have a new different traceId.

Copy link
Contributor Author

@maxgolov maxgolov Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are adding it to every single event instead

Not every single event, not at LogManager level. Only for events on specific component logger, that are relevant to specific service interaction. However, if a client controls the initial seed of traceId, then the same traceId could be stamped on different logger contexts, for example:

  • auto loggerA = GetLogger("serviceA-REST-Logger")
  • auto loggerB = GetLogger("serviceB-REST-Logger")
  • auto loggerC = GetLogger("serviceC-REST-Logger").

Then you use SetContext at logger-level to stamp the same traceId, and their client-side interaction events would show up in one distributed tracing call tree / time ladder chart. Correspondingly, the far end, e.g. C# service - always stamps the same set of fields on all of Log and Span events coming from C# land. That's how it works, and it does stamp it on most (all?) service events too.

Second example in APITest actually shows how you can selectively append the context directly on just one event without using SetContext.. So both ways are possible, logger context - all events for specific named logger (this would be the preferred model). And local (event property) example - stamping on just one event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed more offline. Having a parentId for an event is a valid model even if the system doesn't follow the span model (see also comment in TraceContext standard). Supporting explicit Span semantics is sort of overlapping with what OpenTelemetry does and out of scope for now.

@lalitb
Copy link
Contributor

lalitb commented Nov 14, 2022

Most of the attributes/fields defined in the ISemanticContext interface are not going to change after bootstrapping of SDK. Does Distributed Tracing attributes fit in this category - as trace_id/ span_id could change based on the context where Logger::LogEvent() is called.

Should we instead have an overloaded Logger::LogEvent method which accepts trace context. Something similar to how it is done in OpenTelemetry logging api - https://github.com/open-telemetry/opentelemetry-cpp/blob/389b84f4303f728c97dcf8194f76277488e18cf4/api/include/opentelemetry/logs/logger.h#LL59C3-L59C3

@lalitb
Copy link
Contributor

lalitb commented Jan 6, 2023

@maxgolov - Do you plan to address the concerns raised in this PR, and make it ready for merge. Thanks.

@maxgolov
Copy link
Contributor Author

@lalitb
I'd like to address some comments from reviewers, i.e.

  • adding a method that allows to set both TraceId+SpanId at once
  • possibly exposing a method like this to higher-level apps (I need it to work via C API too)

Apologies for keeping it open for so long. It's coming soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants