Add Distributed Tracing semantics #1066

maxgolov · 2022-10-31T23:57:10Z

Add Distributed Tracing semantics to the following APIs:

ISemanticContext - allows to set (optional) Distributed Tracing context on all events emitted by ILogger
ability to use env_dt aliases to customize TraceContext of individual events running in a given scope. That allows manual nesting of a child span to a parent span.
semantic context projection for C++/CL/CX apps on Windows
add test that emits two events with DT semantics

PR adds "common" fields:

env_dt_traceId
env_dt_spanId
env_dt_traceFlags
Part B parentId

And the set of corresponding high-level API setters at context-level:

SetTraceId
SetSpanId
SetTraceFlags
SetParentId

which are styled in the same way as other common context setters are presently styled, in uniform fashion.

How to use it:

events land in Aria-Kusto (Aria schema) or Custom Kusto (common schema), and could be subsequently routed to Geneva via 1DS-Interchange (Geneva schema).
events landing in Geneva have the required ext.dt extension semantics, just enough for the Trace Explorer view to work.. Less duration semantics: since 1DS in principle does not require client events to have "duration", start / end / scope. Events are timepoints that carry timestamp and name. How exactly to remap these, i.e. what would/could/should be considered duration - is out-of-scope of this PR. A separate routing example would be provided elsewhere, out-of-scope for Client Telemetry repo.

How this could be improved further:

There's no helper for formatting individual W3C TraceContext fields here. Users may use whatever standard formatter of their choice, e.g. taking a dependency on OpenTelemetry API for well-formed context of trace parent header values. In our org scenario we don't need it - since we are utilizing existing .NET APIs that already give us the access to values in the proper format. So we're only calling into 1DS C++ SDK via projection to set and send the well-formed values we need. We do not require manual formatting of these values in C++ land at all.
Should someone need to learn how to do this formatting in C++ - please refer to the OpenTelemetry C++ code here, that illustrates how to format TraceId and SpanId to hexadecimal string values.

Example showing how events look when traced from client to service in one Kusto table:

APITest verifies the basic functionality. More complex usage examples are not presently a priority for our team, since in our case we use the higher-level .NET StartActivity API while interacting with remote service instrumented with OpenTelemetry, and supplement the existing client telemetry events with correlation identifiers that follow Common Schema naming, specifically Common Schema -to- Geneva mapping.

Separate additional work could be done later - to add Span and Log semantics, allowing to use 1DS SDK as a lower-level transport channel / exporter for OpenTelemetry SDK. Presently this additional work is out-of-scope.

- ISemanticContext - for all events emitted by ILogger - ability to use env_dt aliases on individual events - semantic context projection for C++/CL/CX apps on Windows - add test that emits two events with DT semantics

fibann · 2022-11-03T12:07:22Z

lib/include/public/ISemanticContext.hpp

+        /// Set the W3C TraceContext TraceFlags information of telemetry event.
+        /// </summary>
+        /// <param name="TraceFlags">TraceContext TraceFlags</param>
+        virtual void SetTraceFlags(uint8_t traceFlags)


nit: the current spec only supports one flag (sampled) so I wonder if we should rather:

add a SetSampled(bool) method, or

change this to take a bitfield, or

provide an enum or a TRACEFLAGS_SAMPLED constant to use here instead of '1'

(the last is probably best since it doesn't constrain the caller)

W3C spec and Common Schema support traceFlags:

An integer representation of the W3C TraceContext trace-flags.

Yes I am saying that we could have

struct TraceFlags { unsigned char sampled: 1, unsigned char reserved: 7 }; ... virtual void SetTraceFlags(TraceFlags traceFlags) ... TraceFlags flags; flags.sampled = 1; context.SetTraceFlags(flags);

or rather

enum TraceFlags { Sampled: 1, }; ... context.SetTraceFlags(TraceFlags::Sampled);

rather than passing a raw uint8_t.

We could later on add any additional method overloads we want. Indeed, we could add yet another method what you're proposing. I would rather go with an additional method that just does SetTraceContext. My goal was to expose the bare min set of flags, as per spec:

in Common Schema spec this is int

in W3C TraceContext spec this described as byte / uint8_t

Thus, I followed the W3C TraceContext guidelines in this regard. You can see that other implementations, e.g. Elastic Node Traceparent package - did not elaborate on the meaning of the flags field. It's just flags, number eventually encoded as hex. No more no less than what the 1DS (being the transport layer) is expecting to get.

We can add more decoration later. It's add-ons, not the necessity. I'd like to avoid the semantics that could possibly be later on borrowed from elsewhere, e.g. these semantics all exists in finer detail in OpenTelemetry API. The goal of this PR is to cover the bare min set of transport level detail, in order to enable the POC work that we're doing.. Incremental changes could come later.

Ok sounds good not to over-specify.

fibann · 2022-11-03T12:37:22Z

lib/include/public/CommonFields.h

+#define COMMONFIELDS_DT_TRACEID                              "env_dt_traceId"
+#define COMMONFIELDS_DT_SPANID                               "env_dt_spanId"
+#define COMMONFIELDS_DT_TRACEFLAGS                           "env_dt_traceFlags"
+#define COMMONFIELDS_DT_PARENTID                             "parentId"


My understanding is that parentId is meant to be used in events with Span semantics. Here we are adding it to every single event instead, which looks a bit weird. Are we doing this because supporting explicit Span semantics is more complicated? Would it cause problems if we decide to support explicit Span semantics later?

There are two separate logical destinations here:

first three element: traceId, spanId, traceFlags - is our current TraceContext, what we would've sent in traceparent header. That's what spec is describing.

the last one: parentId - is what we receive as our parent, describes current-to-parent relationship, i.e. what initiated us to run. Our span's parent. Developer focusing on client telemetry debugging would see this on one event. This is not part of traceparent sent to the remote end, but part of client telemetry that the our code emits. That field allows us to stitch together the caller (our parent) and our current span into tree view.. Without this information attached on event - the Trace View explorer UX won't be able to stitch the two operations on time ladder chart..

So:

spanid -- is our current span id. Part of W3C TraceContext spec.

parentId -- is our parent, our remote span id.

parentId (prev) -> spanId (current) -> then remote service would generate its own $spanId(of service) (next)

How you would use it:

Obtain the ILogger associated with logging some service information, i.e.
auto apiLogger = GetLogger("CatalogServiceAPILogger")
This is similar to how ILogger gets allocated for a component in .NET.

Then you SetContext(...) on that logger's semantic context. All events coming from that logger are associated with the service interaction. This is all that's necessary for the Distributed Trace Explorer view to show your events coming from that logger on a time ladder chart.

When your client starts interacting with catalog service via REST API call, you'd initiate your transaction with some initial seed for the traceparent header: your traceId, spanId, traceFlags - the ones you set on context... And you pass it to remote service.

Your client emits events using apiLogger->LogEvent -- all of these events get indexed by correlation platform and contain the necessary semantics to render them in a trace view.

If you need to re-initialize the session, i.e. user logged out / logged in, you'd have the new transaction - traceId, and you'd update the context on that logger. All further events now have a new different traceId.

Here we are adding it to every single event instead

Not every single event, not at LogManager level. Only for events on specific component logger, that are relevant to specific service interaction. However, if a client controls the initial seed of traceId, then the same traceId could be stamped on different logger contexts, for example:

auto loggerA = GetLogger("serviceA-REST-Logger")

auto loggerB = GetLogger("serviceB-REST-Logger")

auto loggerC = GetLogger("serviceC-REST-Logger").

Then you use SetContext at logger-level to stamp the same traceId, and their client-side interaction events would show up in one distributed tracing call tree / time ladder chart. Correspondingly, the far end, e.g. C# service - always stamps the same set of fields on all of Log and Span events coming from C# land. That's how it works, and it does stamp it on most (all?) service events too.

Second example in APITest actually shows how you can selectively append the context directly on just one event without using SetContext.. So both ways are possible, logger context - all events for specific named logger (this would be the preferred model). And local (event property) example - stamping on just one event.

Discussed more offline. Having a parentId for an event is a valid model even if the system doesn't follow the span model (see also comment in TraceContext standard). Supporting explicit Span semantics is sort of overlapping with what OpenTelemetry does and out of scope for now.

lalitb · 2022-11-14T07:56:06Z

Most of the attributes/fields defined in the ISemanticContext interface are not going to change after bootstrapping of SDK. Does Distributed Tracing attributes fit in this category - as trace_id/ span_id could change based on the context where Logger::LogEvent() is called.

Should we instead have an overloaded Logger::LogEvent method which accepts trace context. Something similar to how it is done in OpenTelemetry logging api - https://github.com/open-telemetry/opentelemetry-cpp/blob/389b84f4303f728c97dcf8194f76277488e18cf4/api/include/opentelemetry/logs/logger.h#LL59C3-L59C3

lalitb · 2023-01-06T18:53:55Z

@maxgolov - Do you plan to address the concerns raised in this PR, and make it ready for merge. Thanks.

maxgolov · 2023-02-14T04:33:50Z

@lalitb
I'd like to address some comments from reviewers, i.e.

adding a method that allows to set both TraceId+SpanId at once
possibly exposing a method like this to higher-level apps (I need it to work via C API too)

Apologies for keeping it open for so long. It's coming soon.

Add Distributed Tracing semantics to the following APIs:

277bc85

- ISemanticContext - for all events emitted by ILogger - ability to use env_dt aliases on individual events - semantic context projection for C++/CL/CX apps on Windows - add test that emits two events with DT semantics

maxgolov requested review from kindbe, mkoscumb, reyang, bliptec, lalitb and sid-dahiya as code owners October 31, 2022 23:57

Fix typo

4c0df71

maxgolov requested review from fibann and stephenatwork November 1, 2022 00:04

fibann reviewed Nov 3, 2022

View reviewed changes

fibann approved these changes Nov 4, 2022

View reviewed changes

Merge branch 'main' into user/maxgolov/1ds_distributed_tracing

874d5d3

maxgolov self-assigned this Feb 14, 2023

maxgolov added the distributed tracing label Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Distributed Tracing semantics #1066

Add Distributed Tracing semantics #1066

maxgolov commented Oct 31, 2022 •

edited

Loading

fibann Nov 3, 2022

maxgolov Nov 3, 2022

fibann Nov 3, 2022

maxgolov Nov 3, 2022 •

edited

Loading

fibann Nov 4, 2022

fibann Nov 3, 2022

maxgolov Nov 3, 2022 •

edited

Loading

maxgolov Nov 3, 2022

maxgolov Nov 3, 2022 •

edited

Loading

fibann Nov 4, 2022

lalitb commented Nov 14, 2022

lalitb commented Jan 6, 2023

maxgolov commented Feb 14, 2023

Add Distributed Tracing semantics #1066

Are you sure you want to change the base?

Add Distributed Tracing semantics #1066

Conversation

maxgolov commented Oct 31, 2022 • edited Loading

fibann Nov 3, 2022

Choose a reason for hiding this comment

maxgolov Nov 3, 2022

Choose a reason for hiding this comment

fibann Nov 3, 2022

Choose a reason for hiding this comment

maxgolov Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

fibann Nov 4, 2022

Choose a reason for hiding this comment

fibann Nov 3, 2022

Choose a reason for hiding this comment

maxgolov Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

maxgolov Nov 3, 2022

Choose a reason for hiding this comment

maxgolov Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

fibann Nov 4, 2022

Choose a reason for hiding this comment

lalitb commented Nov 14, 2022

lalitb commented Jan 6, 2023

maxgolov commented Feb 14, 2023

maxgolov commented Oct 31, 2022 •

edited

Loading

maxgolov Nov 3, 2022 •

edited

Loading

maxgolov Nov 3, 2022 •

edited

Loading

maxgolov Nov 3, 2022 •

edited

Loading