This RFC describes a change that would:
- Bring initial support for traces
- Ingest traces from the Datadog trace agent (a.k.a. the client side of the APM product)
- Relay ingested traces to Datadog
This RFC is part of the global effort to enable Vector to ingest & process traffic coming out of Datadog Agents. Vector internal tracing has its own RFC.
Official "Datadog Agent" bundles (rpm/deb/msi/container image) actually ship multiple binaries, collectively named
"agents". Each of these "agents" is tasked to collect some data. For example "the core agent" (often shortened to "the
agent", because it was the first of those agents to be released) is the one collecting metrics, logs and running checks.
There are other agents like the process-agent, the security-agent or the trace-agent. But all of those are part of the
official "Datadog Agent" distribution logic, and they all come out of the datadog-agent codebase. So we are focusing
on the trace-agent
which is one of the several binaries shipped along with others agents.
Traces are collected by this specific agent, that comes with a lot of dedicated configuration settings (usually under
the apm_config
prefix), but it also shares some global option like site
with other agents to select the Datadog
region where to send data. It exposes a local API, that is used by tracing libraries to submit traces & profiling
data.
It has several communication channels to Datadog:
- Processed traces are relayed to
trace.<SITE>
(can be overridden by theapm_config.apm_dd_url
config key) - Additional profiling data are relayed to
intake.profile.<SITE>
(can be overridden by theapm_config.profiling_dd_url
config key), they are not processed by the trace-agent and relayed directly to Datadog - Some debug logs are simply proxied to the log endpoint
http-intake.logs.<SITE>
(can be overridden by theapm_config.debugger_dd_url
config key), it is fairly [recent] and unused as of October 2021. - It computes some stats on incoming traces, this is an aggregation of statistics submitted by some
tracing libraries and also computed by the
trace-agent
. Aggregated stats are send back to Datadog to the same host as processed traces (trace.<SITE>
). Tracer-side stats are supported since Agent 7.25, but APM stats computed by thetrace-agent
itself are not strictly mandatory but they produce very useful stats. - It emits metrics, it does so for observability purpose, those metrics are sent to the core agent using the dogstatsd protocol, usually running on the same host (it could be in a different container if using the official Helm chart). The core agent then forward all those metrics with additional enrichment (hostname, tags) to Datadog.
Profiling and Tracing are enabled independently on traced applications. But they can be correlated once ingested at Datadog, mainly to refine a span with profiling data.
The trace-agent encodes data using protobuf, .proto are located in the datadog-agent repository. Trace-agent requests to the trace endpoint contain two major kind of data:
- standard traces that consist of an aggregate of spans (cf.[1] & [2])
- it also sends "selected" spans (a.k.a. APM events) (cf.[3], they are extracted by the trace-agent, once ingested by Datadog those selected spans are used under the hood to better identify & contextualize traces, they can be fully indexed as well (short description of APM events)
On-going work to support event schema would allow to express some constrains on an event structure. In this case this would allow to formalize a trace schema while keeping the underlying data as standard Vector event. The trace sink would then expect event following this schema.
- Ingest traces from the trace-agent in the
datadog_agent
source - Send traces to the Datadog trace endpoint through a new
datadog_trace
sink - Basic operation on traces: filtering, routing
- Pave the way for OpenTelemetry traces
- Profiling data, but as the trace-agent only proxies profiling data, the same behaviour can be implemented rather quickly in Vector.
- Debugger logs can be already diverted to Vector (untested but it should work as Vector supports datadog logs and there is an config option to explicitly configure the debugger log destination)
- Metrics emitted by the trace-agent (they could theoretically be received by Vector by a statsd source, but the host used by the trace-agent is derived from the local config to programmatically discover the main agent, thus there is no existing knob to force the trace agent to send metrics to a custom dogstatsd host)
- Span extraction, filtering
- Other sources & sinks for traces than
datadog_agent
source &datadog_trace
sink
Vector does not support any traces (full json representation may be ingested as log event) at the moment and it is a key part of observability. Therefore, users cannot use Vector for the business-level user cases on trace data, like cost control and reduction, redacting PII, routing, and more.
- User will be able to ingest traces from the trace agent
- Vector config would then consist of:
datadog_agent
source -> some filtering/enrichment transform ->datadog_trace
sink - Datadog trace agent can be configured to send traces to any arbitrary endpoint using
apm_config.apm_dd_url
config key
- Vector config would then consist of:
- This change is a pure addition to Vector, there will be no impact on existing Datadog trace agent features
To keep vector-core as generic as possible, the first implementation will decode datadog traces as LogEvent
, the
resulting event will be deeper than usual but this should not be a problem. In order to distinguish trace from log,
the Event
enum will get a new Trace
variant that will wrap LogEvent
.
Upcoming work on having the ability to validate a LogEvent
against a schema would provide a
nice way (with the performance question) of ensuring that a datadog-traces
sinks would receive a properly
structured LogEvent
.
Based on the aforementioned work the following source & sink addition would have to be done:
- A
datadog_agent
addition that decodes incoming gzip'ed protobuf over http to aLogEvent
.proto files are located in the datadog-agent repository. - A new
datadog_trace
sink that does the opposite conversion and sends the trace to Datadog to the relevant region according to the sink config.
The datadog_agent
agent addition would materialize as new filter (like the one dedicated to receive
logs), ideally colocated the trace decoding logic in its own source file
(./src/sources/datadog/traces.rs). The filter would be attached to the warp server upon a new configuration flags. This
way the traces related code would be isolated. New configuration flags would be three booleans, for logs, metrics and
traces enabling/disabling each datatype. This way the user can multiplex all three datatype over a single socket, or a
socket per one or more datatype at users convenience.
Datadog API key management would be the same as it is for Datadog logs & metrics.
Regarding APM stats, if we envision the datadog_trace
sink as a universal sender for any kind of traces ingested by
Vector, it shall ultimately support computing APM stats, even if the stats payload is a bit complex
(it includes ddsketches) as this provides valuable stats on ingested traces. The Datadog OTLP traces exporter also
computes those stats. How Vector will handle APM stats is discussed in its own
RFC.
- Traces support is expected by users
- Local sampling is an interesting feature to lessen the amount of data sent to Datadog
- Using
LogEvent
s to represent traces implies that, until schemas are available, the format a trace sink would expect cannot be simply expressed and the sink will have to implement various sanity checks to ensure that received events are properly structured.
- Internal Rust traces can be converted into log event, but this is not reversible. This is still a good way of getting a text-based representation
- Regarding internal traces representation, instead of reusing the
LogEvent
type, a newTrace
concrete type could be added to theEvent
enum:- Either specific implementation per vendor, allowing almost direct mapping from protocol definition into Rust struct(s).
- Or generic enough struct, most likely based on the OpenTelemetry trace format, possibly with additional fields to cover corner cases and/or metadata that may not be properly mapped into the OTLP trace structure. Overall, there is no huge discrepancy between Datadog traces and OpenTelemetry traces (The trace-agent already offers OTLP->Datadog conversion).
None.
- Write a subsequent RFC discussing how APM stats will fit in Vector.
- Introduce the new
Trace
variant in theEvent
enum. - Submit a PR introducing traces support in the
datadog_agent
source emitting aLogEvent
for each trace and each APM event. It will re-organize the source to isolate generic code from data type specific code. APM stats will be dropped at this point. - Submit a PR introducing the
datadog_trace
that turns relevantLogEvent
back into Datadog protobuf-encoded traces. - Do the APM stats work.
- As soon as the schema feature is available, use it to express the expected trace format.
- Support for additional trace sources and sinks, probably OpenTelemetry first
- Profile support
- Ingest traces from Datadog tracing libraries directly
- Opentelemetry exporter support (the Datadog export would probably be easily supported once this RFC has been implemented as it's using the same Datadog endpoint as the trace-agent
- Traces helpers in VRL
- Trace-agent configuration with a
vector.traces.url
&vector.traces.enabled