While Vector has the ability to decode and encode data from a variety of formats such as plaintext
or JSON, it currently lacks the ability to encode its internal event representation to a portable
format that can be used with existing sources and sinks other than vector
. We propose adding new
codecs to allow just that.
- Standardized metric encoding
- Make vector less opinionated: allow customization of the "shape" of expected/outputted events for "generic" sources/sinks
- Use Kafka as a buffer between two Vector instances (decoding Vector data)
- Send metrics data to Prometheus through Kafka
- Datadog codec
- Allow metrics to be sent and received with http source and sink
- Decoders for sources
- This would overlap with the tailored Buffer Improvements work happening in 2021Q4, as native event encoding/decoding could also be used to emulate buffering between Vector processes.
- Allowing users to push data into Vector (via a standard source) that can be decoded natively into the internal Vector event representation.
- Handling logs, metrics, and eventually traces, both for decoding and encoding.
- Providing a simple schema that developers could reference if manually generating payloads outside of Vector itself.
- Implementing it for a specific source/sink pair.
- Forward/backwards compatibility guarantees for anything except for Protocol Buffers.
- Handling arbitrary native formats in differing sources/sinks i.e. letting a statsd source parse the Prometheus exposition format, or having the Kafka sink spit out metrics in the Influx line protocol, etc.
- Versioned schemas for formats that are not already implicitly versionable. (i.e. no versioned JSON schema)
Users routinely use Vector as a unifying step in their observability pipeline: taking disparate sources and transforming, filtering, and cleaning up that data before sending it off to downstream systems. This means that often times, Vector may not support the type of data they want to send, and there's a required step of adapting their data to use with Vector. This isn't a problem that can be entirely solved, but one that is currently harder for users to solve than it should be.
Sources like exec
were borne out of a desire
to let users arbitrarily feed data into Vector from a simple shell script or process, which itself
could trivially pull and generate whatever data was desired. However, there are still limitations
due to the fact that users must do subsequent transformation steps to extract metrics from log lines,
and so on.
As well, users are constrained when they want to send data from one Vector instance to another by
Vector only supporting this via the native gRPC-based vector
source and sink. If users already
had a blessed solution for service-to-service data flow, such as Kafka, they would be stuck using the
aforementioned transformation steps to go back and forth between available encoded formats and back
into the desired metric types within Vector.
Users would be able to specify two new encoding types for supported sources and sinks, called
vector_native
and vector_json
, that would encode and decode the data natively into the internal
Vector Event
type from both Protocol Buffers and JSON, respectively.
The vector_native
codec uses Protocol Buffers and mirrors the codec used by the vector
source
and sink. This codec follows our public Protocol Buffers definition in the repository, and is
treated as a tier one schema: we commit to not updating the protocol in backwards or
forwards-incompatible ways.
The vector_json
codec uses JSON and would generally mirror the internal structure of a Vector
event flowing through the system. This codec has minimal support for versioning, and is subject to
change as the internal representation of Vector events evolves over time. A human-readable schema would
be generated as part of builds/releases, and would be mentioned in upgrade guides when there is a
breaking change, but we would generally only commit to interoperability between Vector instances
running the same version.
- The
vector_native
codec would be based on the same exact Protocol Buffers definition we use for thevector
source and sink. - The
vector_json
codec would be based on usingserde
to serializeEvent
to JSON. - Both
LogEvent
andMetric
derive aserde::Deserialize
implementation already, whileMetric
also derives aserde::Serialize
implementation. - We would add missing
serde
derives toLogEvent
andEvent
itself, allowing top-levelEvent
s to be trivially serialized and deserialized. - We would not serialize/deserialize event metadata, which currently only includes event finalizers and a Datadog API key override field.
- Event metadata may come into scope in the future where there is a more generalized mechanism for adding metadata to events, but we would need to design a mechanism to filter "internal" metadata vs "external" metadata, as we would not want to push API keys in plaintext, etc.
- The existing framing/codec work happening for both sources and sinks would gain two new
implementations for
vector_native
andvector_json
, respectively. - We would use
serde-reflection
to generate a basic schema ofEvent
, which could be stored in the source code itself, similar in principle toCargo.lock
. This would serve as the minimum viable schema for JSON use cases, without any commitment to versioning or backwards/forwards-compatibility.
Adding encodings for natively representing events would provide an additional avenue for users to both ingest data into Vector, as well as constructing more complex Vector deployment topologies. As Vector development can often be bottlenecked when it comes to adding new sources and sink, this work would act as a force multiplier for letting users invest a small amount of time converting their data to the native format, and then being able to universally ingest it.
If we didn't do this, it would not necessarily hurt the long-term goals of Vector, but it would require more effort over time in order to develop new sources and sinks to meet the demands of users who wish to use Vector with systems we don't already support. This could hurt the long-term success of Vector.
Encoding Event
natively via Protocol Buffers should be a feature we can accomplish with no
additional burden on the Vector team, as we already perform the necessary due diligence and spend
time ensuring that our Protocol Buffers definition stays backwards/forwards-compatible.
Encoding Event
to JSON, however, could result in more time spent by the Vector team on support to
the lack of a stringent schema, given that we would not be transforming Event
to a known
definition like Protocol Buffers. While the idea to generate a minimal viable schema could
hopefully alleviate some of those concerns, it still doesn't address the notion of not providing
versioned JSON schemas or backwards/forwards-compatibility, which would likely represent the bulk of
issues brought to us for users utilizing vector_json
.
Most of the relevant prior art would be related to our existing usage of Protocol Buffers for
Vector-to-Vector communication via the vector
source and sink.
Additionally, there is an existing/draft standard for JSON Schema
, a schema for JSON. This would
be a more robust schema to provide users compared to what serde-reflection
can generate. However,
it still does not deal with versioning of the schema, or make it any easier to do
backwards/forwards-compatible changes to the schema.
At a higher level, of the typical alternatives to Vector, Cribl appears to be the only solution where their TCP JSON source can accept an arbitrary JSON payload that allows setting what Cribl calls "internal" fields. However, these fields are "used only within Cribl LogStream, and are not passed down to Destinations"[1] and so this does not appear to be a generic solution comparable to what this RFC proposes.
We could continue to push the vector
source and sink as the supported method of handling
Vector-to-Vector communication. In practice, users seem to be fine with utilizing this approach,
and it is the basis of our Kubernetes-based aggregator deployment pattern. Practically speaking,
any environment which utilizes another technology as their primary mechanism for service-to-service
data flow could technically allow the communication flows to allow the vector
source and sink to
be used. However, this does not address the potential desire for Vector to fit into an existing
infrastructure, rather than the other way around.
Additionally, we could also provide more specific codecs designed around specific protocols --
Prometheus exposition format, Influx line protocol, etc -- and allow those to be configured at a
source and sink level. This requires users to add additional source/sink pipelines to their
configurations to handle those event types specifically. This could mean dedicated Kafka topics, or
HTTP endpoints, and so on, depending on the event type/codec chosen, instead of the "universal"
format provided by shipping Event
s natively.
- Is there a better format than JSON that we could/should use as the human-readable variant?
Incremental steps to execute this change. These will be converted to issues after the RFC is approved:
- Implement the remaining
serde
derives onEvent
andLogEvent
, including event metadata exclusion. - Add support for both
vector_native
andvector_json
to the existing framing/decoding infrastructure used for sources. - Add support for both
vector_native
andvector_json
toStandardEncodings
to provide the sink-side support, or to the framing/encoding infrastructure if it supports sinks by then. - Use
serde-reflection
to generate a minimum viable schema definition that can be added to the repository, potentially as a Vector subcommand so Vector binaries can be self-documenting.
- Adding a CI step that runs the same Vector subcommand (or whatever approach we use for running
serde-reflection
) and compares it to whats currently in the repository, to ensure we don't let the schema get out-of-sync from the actualserde
output.