This RFC aims to describes how to add an OpenTelemetry traces source to Vector and also address Vector internals adjustment required for future extension to other trace types.
datadog_agent
source supports receiving traces from the Datadogtrace-agent
datadog_traces
sink supports emitting traces to Datadog- OpenTelemetry traces are already supported by Datadog:
- Either with the Datadog exporter using the Opentelemetry collector (without the
trace-agent
) - Or with the
trace-agent
configured to receive OpenTelemetry traces (both gRPC and HTTP transport layer are supported)
- Either with the Datadog exporter using the Opentelemetry collector (without the
As the whole traces processing inside Vector is pretty new, documenting confirmed and most credible use cases in the near future will help to ensure changes will be implemented so they will be really useful to potential users. This also help to build something flexible enough to accommodate future needs.
One identified scenario is to demux a trace flow based on some conditions that could be evaluated against any metadata for a single trace, a group of traces or per spans. From a config perspective this would expect to be functional with the following configuration:
[...]
sources:
otlp:
type: opentelemetry
address: "[::]:8081"
mode: grpc
transforms:
set_key:
type: remap
source: |
if exists!(.tags.user_id) {
return
}
key = get_enrichment_table_record!("api_keys", { "user": .tags.user_id })
set_metadata_field("datadog_api_key", key)
inputs:
- otlp.traces # Would exclusively emit traces
sinks:
dd_trace:
type: datadog_traces
default_api_key: 12345678abcdef
inputs:
- set_key
This demux/conditional action can be seen as an extension of what currently exists in Vector. Other kind of conditional
action like the filter
transform to discard traces base on certain metadata can be considered to be very similar, as
this also involve evaluating a VRL condition on traces. The key problem here is to exposes traces and spans field in a
way that the user can still manipulate those easily.
This however raises the case of the granularity of a single event ; for instance multiple traces can bundle into a single payload in both OpenTelemetry and Datadog wire format. Enabling clear processing without ambiguity advocate for a clear constraint that should be enforced by all future traces sources : a single Vector event shall not hold data relative to more that one trace.
A completely different usecase is traces sampling, but it cover two major variations:
- Simple sampling: either cap/pace the trace flow at a given rate or sample 1 trace per 10/100/1000/etc. traces, and
this is already available thanks to the
sample
andthrottle
transforms - Outliers isolation, this would mean keeping some traces based on some advanced criteria, like execution time above p99, this would require comparison against histogram / sketches.
Another valuable identified usecase is the ability to provide seamless conversion between any kind of Vector supported traces, this means that the Vector internal traces representation shall be flexible enough to accommodate conversion to/from any trace format in sources and sinks that work with traces. Given the traction from the Opentelemetry project, and the fact that it comes with a variety of fields to cover most usecases.
N/A
opentelemetry
source, with both http and grpc support, decoding traces only, but with provision for other datatypes and emitting traces on a named outputtraces
- Support
opentelemetry
source todatadog_traces
sink forwarding by dealing with:- Traces normalization to a single format inside Vector
- Conversion to/from this format in all traces sources/sinks
- APM stats computation logic, with an implementation fully functional for traces from both the
opentelemetry
and thedatadog_agent
sources.
N/A
- Avoid complex setup when ingesting traces, ultimately pointing every tracing lib directly to Vector should just work out-of-the-box with minimal config.
- User would point OpenTelemetry tracing lib directly to a local Vector deployment
- Vector would be configured with a minimal config looking like:
sources:
otlp:
type: opentelemetry
address: "[::]:8081"
mode: grpc
sinks:
dd_trace:
type: datadog_traces
default_api_key: 12345678abcdef
inputs:
- otlp.traces # Would exclusively emit traces
And it should just work.
Based on the usecases previously detailed the implementation will we can extract the following top-level requirements:
- A Vector trace event shall only contain data relative to one single trace, i.e. traces sources shall create one event for each individual trace ID and its associated spans and metadata.
- Use the Opentelemetry trace format as the common denominator and base the Vector internal representation to ensure :
- A clear reference point for conversion between trace formats
- Avoid destructive manipulation by transforms and keep traces object fully functional even after heavy modifications while flowing through the topology
A new opentelemetry
source with a named output traces
(future extension would cover metrics
then logs
):
- The gRPC variant would use Tonic to spawn a gRPC server (like the
vector
source in its v2 variation) and directly use the official gRPC service definitions, only the traces gRPC service will be accepted, this should be relatively easy to extend it to support metrics and logs gRPC services. - HTTP variant would use a Warp server and attempt to decode protobuf payloads, as per the specification,
payloads are encoded using protobuf either in binary format or in JSON format (Protobuf schemas).
All the expected behaviours regarding the kind of requests/responses code and sequence are clearly defined as well
as the default URL path (
/v1/traces
for traces, demuxing/v1/metrics
and/v1/logs
later should not be a problem).
For cross format operation like opentelemetry
source traces
output to datadog_traces
sinks or the opposite
(Datadog to OpenTelemetry) trace standardization is require so between sinks/sources traces will follow one single
universal representation, there is two major possible approaches:
- Stick to a
LogEvent
based representation and leverage Vector event schema - Move traces away from their current representation (as LogEvent) and build a new container based on a set of dedicated structs representing traces and spans with common properties and generic key/value store(s) to allow a certain degree of flexibility.
The second option would have to provide a way to store, at least, all fields from both Opentelemetry and Datadog Traces. If we consider the protobuf definition for both Datadog and OpenTelemetry, it is clear that the OpenTelemetry from come with extra structured fields that are not present in Datadog traces. However having a generic key/value container in virtually all traces formats can be used to store data that do not have a dedicated field in some format. As a reflexion basis the Datadog and OpenTelemetry are provided below, there is no hard semantic differences.
Datadog newer trace format (condensed):
message Span {
string service = 1;
string name = 2;
string resource = 3;
uint64 traceID = 4;
uint64 spanID = 5;
uint64 parentID = 6;
int64 start = 7;
int64 duration = 8;
int32 error = 9;
map<string, string> meta = 10;
map<string, double> metrics = 11;
string type = 12;
map<string, bytes> meta_struct = 13;
}
message TraceChunk {
// priority specifies sampling priority of the trace.
int32 priority = 1;
// origin specifies origin product ("lambda", "rum", etc.) of the trace.
string origin = 2;
// spans specifies list of containing spans.
repeated Span spans = 3;
// tags specifies tags common in all `spans`.
map<string, string> tags = 4;
// droppedTrace specifies whether the trace was dropped by samplers or not.
bool droppedTrace = 5;
}
// TracerPayload represents a payload the trace agent receives from tracers.
message TracerPayload {
// containerID specifies the ID of the container where the tracer is running on.
string containerID;
// languageName specifies language of the tracer.
string languageName;
// languageVersion specifies language version of the tracer.
string languageVersion = 3 ;
// tracerVersion specifies version of the tracer.
string tracerVersion = 4;
// runtimeID specifies V4 UUID representation of a tracer session.
string runtimeID = 5;
// chunks specifies list of containing trace chunks.
repeated TraceChunk chunks = 6;
// tags specifies tags common in all `chunks`.
map<string, string> tags = 7;
// env specifies `env` tag that set with the tracer.
string env = 8;
// hostname specifies hostname of where the tracer is running.
string hostname = 9;
// version specifies `version` tag that set with the tracer.
string appVersion = 10;
}
Opentelemetry trace format (condensed):
message InstrumentationLibrarySpans {
opentelemetry.proto.common.v1.InstrumentationLibrary instrumentation_library = 1;
repeated Span spans = 2;
string schema_url = 3;
}
message Span {
bytes trace_id = 1;
bytes span_id = 2;
string = 3;
bytes parent_span_id = 4;
string name = 5;
enum SpanKind {
SPAN_KIND_UNSPECIFIED = 0;
SPAN_KIND_INTERNAL = 1;
SPAN_KIND_SERVER = 2;
SPAN_KIND_CLIENT = 3;
SPAN_KIND_PRODUCER = 4;
SPAN_KIND_CONSUMER = 5;
}
SpanKind kind = 6;
fixed64 start_time_unix_nano = 7;
fixed64 end_time_unix_nano = 8;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 9;
uint32 dropped_attributes_count = 10;
message Event {
fixed64 time_unix_nano = 1;
string name = 2;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 3;
uint32 dropped_attributes_count = 4;
}
repeated Event events = 11;
uint32 dropped_events_count = 12;
message Link {
bytes trace_id = 1;
bytes span_id = 2;
string trace_state = 3;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 4;
uint32 dropped_attributes_count = 5;
}
repeated Link links = 13;
uint32 dropped_links_count = 14;
Status status = 15;
}
The key construct in all trace formats is the span and traces are a set of spans. The OpenTelemetry span structure is rather verbose and comes with complex nested field. The Datadog approach is either to ignore those (e.g. the links field is ignored) or encode the complete field into a text representation (e.g. events are encoded using JSON) and include the resulting value into the tags (a.k.a Meta) map.
This makes the opposite conversion a bit complicated if we want it to be completely symmetrical but there was already an attempt allow Datadog traces ingestion in the OpenTelemetry collector. While this PR was closed unmerged this provide a valuable example. Anyways the [otlp-and-other-formats][OpenTelemetry] acknowledges that some of the OpenTelemetry construct ends up being stored as tags or annotations in other formats.
Anyway the OpenTelemetry to Datadog traces conversion is dictated by existing implementations in both the trace-agent
and the Datadog exporter as users will expect a consistent behaviour from one solution to another. The same
consideration applies for APM stats computation, as official implementations already provides a
reference that defines what should be done to get the same result with Vector in the loop. The other way, from Datadog to
OpenTelemetry is less common as of today but while implementing conversions we shall ensure that the following path is
idempotent:
(Datadog Trace) -> (Vector internal format - based on Opentelemetry) -> (Datadog Trace)
There is no particular field or subset of metadata that would prevent idempotency in that case. This remains a strong requirement and shall be applicable to all third party trace format that will be converted to/from the upcoming Vector internal representation for similar scenarios.
Note: The Rust OpenTelemetry implementation implement a conversion from OpenTelemetry traces to the
Datadog trace-agent
format. This is not the purpose of this RFC, and with the OpenTelemetry traces format being
supported on both sides working on better interoperability on that particular common ground would likely be a better
option.
Conclusion: the implementation will stay around ./lib/vector-core/src/event/trace.rs, it
will borrow most of the OpenTelemetry to allow straightforward trace conversion to the newer Vector internal
representation. Regarding datadog_agent
source and datadog_traces
sink the conversion to/from this newer trace
representation will follow existing logic and ensure that standard usecases (like introducing Vector between the Datadog
intake and the trace agent
) do not significantly change the end-to-end behaviour. Some top-level information (Like
trace ID, trace-wide tags/metrics, the original format) are likely to be added to the internal trace representation for
efficiency and convenience.
Trace would not get native VrlTarget
representation anymore, there is a bigger discussion there that should probably
be addressed separately. As an interim measure few fields may be exposed (At least trace ID & trace-wide tags), the spans
list will not be exposed initially.
The APM stats computation can be seen as a generic way to compute some statistics on a traces flow, the following key points have been discussed:
- While APM stats may be useful outside Datadog context, as they are somehow standard metrics, and they could theoretically be useful to any metric backends, but as of today it seems unlikely that this will ever happen. So third-party usage of APM stats won't be addressed until there is demand for it.
- APM stats are essentially a Datadog things, and if metrics should be extracted from traces at some point in the
feature this would probably materialize as a
traces_to_metric
transform, but the exact scope and the usefulness of such a transform remains to be determined. But this will be unrelated to APM stats. - Considering the user experience, it appears that not exposing any APM stats consideration in Vector config is a safe
conservative choice. That being said APM stats coming out of Vector shall remain relevant in all circumstances. Based
on the fact that first identified usecases revolve around routing and filtering the most convenient location to do APM
Stats computation is directly in the
datadog_traces
sink. The major issue is around sampling, statistically speaking distribution metrics wont be impacted, but other metrics (like counter/gauge) will, note that if the sampling rate is known it would still be possible to get an original value estimate for those metrics. Anyway this has to be documented. - Implement a similar logic that the one done in the Datadog OTLP exporter, this would allow user to use multiple Datadog products with Opentelemetry traces and get the same consistent behaviour in all circumstances. APM stats computation is hooked there in the Datadog exporter. But as this is go code it relies on the Agent codebase to do the actual computation.
Conclusion: APM stats computation will follow what's done in the Datadog OTLP exporter and
the computation will happen against the outgoing traces stream directly in the datadog_traces
sink. Incoming APM stats
received in the datadog_agent
will then be ignored.
- Opentelemetry is the de-facto standard for traces, so supporting it at some point is mandatory. Note that this consideration is wider than just traces as metrics (and logs) are addressed by the Opentelemetry project.
Adopting an internal trace representation based on OpenTelemetry seems well suited for application that involves remote submission and processing. However for other traces formats a bit far from the OpenTelemetry format, like the CTF, that can also be emitted while the traced application is running, may not fit very well into an OpenTelemetry-based representation.
N/A
- We could keep the Datadog trace-agent as an OTLP->Datadog traces converter and ingest datadog traces from there
- We could keep the Datadog exporter as an OTLP->Datadog traces converter and ingest datadog traces from there
- We could write a Vector exporter for the Opentelemetry collector, note that this would likely leverage the Vector protocol and this logic could be applied to metrics as well
N/A
- Implement traces normalisation/schema
-
opentelemetry
source, gRPC mode -
opentelemetry
source, HTTP mode - APM stats computation
- Transforms / complete VRL coverage of traces, later helpers to manipulate traces or isolate outliers
- OpenTelemetry traces sink
- Add metrics then log to the
opentelemetry
source.