Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add avro to well-known encodings #993

Closed
wants to merge 5 commits into from
Closed

Add avro to well-known encodings #993

wants to merge 5 commits into from

Conversation

defunctzombie
Copy link
Contributor

@defunctzombie defunctzombie commented Oct 22, 2023

Add avro as a well-known schema encoding and message encoding. When using the "avro" message encoding, mcap writers indicate the messages on a channel are encoded using the avro binary serialization format. When using "avro" schema encoding, the schema format is a valid avro schema declaration as a JSON string.

A few notes/discussion points:

  • In the avro spec avro schema declarations can be one of an avro type, a single object, or an array. However, in the proposed mcap spec I have changed the allowed schema declaration and how they are interpreted. The mcap spec will only allow a single object or an array. The array form is not treated as a union and instead the "name" from the schema record references which record declaration from the array is the schema for the messages on channels using the schema. This is different to avro files typically treating this kind of schema declaration as a union type. This is because avro files do not have the concept of channels whereas mcap files do and writers typically write separate schemas on separate channels and topics.
  • Avro has two encoding formats: binary and json. I've opted to have "avro" message encoding refer to the binary format. If we wanted to support both we would need to consider how to disambiguate.

This change also adds a rust example that creates an mcap file with avro.

Related studio PR: https://github.com/foxglove/studio/pull/7008

- Add a rust writer example
@defunctzombie defunctzombie marked this pull request as ready for review October 23, 2023 02:56
@defunctzombie defunctzombie changed the title Add avro to spec Add avro to well-known encodings Oct 23, 2023

> Further, a name must be defined before it is used (“before” in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come “before” the messages attribute.)

You can define a name inline using a single schema object for `data` or an array of schema objects. If the `data` is an array of schemas, the `name` must reference a single
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an additional restriction on top of the avro spec, or just a helpful note for people trying to follow the spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say its a half-restriction. Avro broadly can mean any of the IDL, the message serialization format, and the container format (avro files). The Avro container format does not have "channels" or "topics" and only supports writing messages that conform to the schema defined in the header. Avro only allows one schema in the header but supports "union" types (similar to protobuf anyof). In avro headers, an array of schema objects is treated as a union type. So a user can write a message to the avro file for any of the schemas in the array.

Typical MCAP use has different semantics because we have channels which can reference specific schemas. So this note is about allowing a user to specify an array of schema objects (so one avro schema object in the array can reference another schema by name without re-defining it - i.e. Point2d, etc) but we need the "name" in the mcap schema record to tell us which of the schemas in the array is the schema that messages are serialized with in that channel. Thus we don't treat an array of schemas as a union like avro does.

### avro

- `name`: Fully qualified name of the record type (including namespace), e.g. `example.MyRecord`
- `encoding`: `avro`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider avro1, in case avro ever releases a version 2?

@james-rms
Copy link
Collaborator

I think we should support all valid avro values in an MCAP message unless we have a really good reason not to.
This means accepting all 3 possible toplevel schemas:

A Schema is represented in JSON by one of:

  • A JSON string, naming a defined type.
  • A JSON object, of the form:
    {"type": "typeName" ...attributes...}
    where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
  • A JSON array, representing a union of embedded types.

I propose for names:

json type mcap schema name
string Starting from an empty set of definitions, the only defined types are the primitive types, so one of null, boolean,int, long, float, double,bytes, string
record The full type name as you've laid out in your PR
array a comma-separated list of names for each type in the union in the same order. I'd also accept a JSON array of strings. Note that unions may not contain unions per the avro spec, so we don't need to support nesting.

My reason for pushing back on this is that MCAP channels are logical channels representing a stream of data, and requiring writers to separate their unions into separate channels may remove some information they'd otherwise want to keep. My go-to example here is: messages in one channel in a recording are generally understood to have been sent in that order and arrive in that order. Messages in separate channels arrived at the recorder in log time order, but may have arrived at other consumers in a different order, and may have been sent in a different order.

I also think it's powerful to be able to project any field of a message into a new MCAP data stream. If only certain Avro types can be toplevel message types, that rules that out for Avro encoding at least.
I know we support other encodings that don't support this - for example, you can't have a ros1msg string in an MCAP message, you need to wrap it in a std_msgs/msg/String struct first. However, for those ecosystems there are well-known struct wrappers for primitive types, for Avro that's not the way.

@defunctzombie
Copy link
Contributor Author

Closing this for now. Good feedback and learnings but the primary customer for this is ok with their current setup and does not need anything urgent here. Until we get an avro user to review and provide feedback I am more comfortable shelving this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants