Add avro to well-known encodings #993

defunctzombie · 2023-10-22T04:21:09Z

Add avro as a well-known schema encoding and message encoding. When using the "avro" message encoding, mcap writers indicate the messages on a channel are encoded using the avro binary serialization format. When using "avro" schema encoding, the schema format is a valid avro schema declaration as a JSON string.

A few notes/discussion points:

In the avro spec avro schema declarations can be one of an avro type, a single object, or an array. However, in the proposed mcap spec I have changed the allowed schema declaration and how they are interpreted. The mcap spec will only allow a single object or an array. The array form is not treated as a union and instead the "name" from the schema record references which record declaration from the array is the schema for the messages on channels using the schema. This is different to avro files typically treating this kind of schema declaration as a union type. This is because avro files do not have the concept of channels whereas mcap files do and writers typically write separate schemas on separate channels and topics.
Avro has two encoding formats: binary and json. I've opted to have "avro" message encoding refer to the binary format. If we wanted to support both we would need to consider how to disambiguate.

This change also adds a rust example that creates an mcap file with avro.

Related studio PR: https://github.com/foxglove/studio/pull/7008

- Add a rust writer example

james-rms · 2023-10-23T03:07:11Z

website/docs/spec/registry.md

+
+> Further, a name must be defined before it is used (“before” in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come “before” the messages attribute.)
+
+You can define a name inline using a single schema object for `data` or an array of schema objects. If the `data` is an array of schemas, the `name` must reference a single


is this an additional restriction on top of the avro spec, or just a helpful note for people trying to follow the spec?

I would say its a half-restriction. Avro broadly can mean any of the IDL, the message serialization format, and the container format (avro files). The Avro container format does not have "channels" or "topics" and only supports writing messages that conform to the schema defined in the header. Avro only allows one schema in the header but supports "union" types (similar to protobuf anyof). In avro headers, an array of schema objects is treated as a union type. So a user can write a message to the avro file for any of the schemas in the array.

Typical MCAP use has different semantics because we have channels which can reference specific schemas. So this note is about allowing a user to specify an array of schema objects (so one avro schema object in the array can reference another schema by name without re-defining it - i.e. Point2d, etc) but we need the "name" in the mcap schema record to tell us which of the schemas in the array is the schema that messages are serialized with in that channel. Thus we don't treat an array of schemas as a union like avro does.

james-rms · 2023-10-23T03:08:21Z

website/docs/spec/registry.md

+### avro
+
+- `name`: Fully qualified name of the record type (including namespace), e.g. `example.MyRecord`
+- `encoding`: `avro`


consider avro1, in case avro ever releases a version 2?

website/docs/spec/registry.md

Co-authored-by: james-rms <[email protected]>

james-rms · 2023-10-24T22:06:27Z

I think we should support all valid avro values in an MCAP message unless we have a really good reason not to.
This means accepting all 3 possible toplevel schemas:

A Schema is represented in JSON by one of:

A JSON string, naming a defined type.

A JSON object, of the form:
{"type": "typeName" ...attributes...}
where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.

A JSON array, representing a union of embedded types.

I propose for names:

json type	mcap schema name
string	Starting from an empty set of definitions, the only defined types are the primitive types, so one of `null`, `boolean`,`int`, `long`, `float`, `double`,`bytes`, `string`
record	The full type name as you've laid out in your PR
array	a comma-separated list of names for each type in the union in the same order. I'd also accept a JSON array of strings. Note that unions may not contain unions per the avro spec, so we don't need to support nesting.

My reason for pushing back on this is that MCAP channels are logical channels representing a stream of data, and requiring writers to separate their unions into separate channels may remove some information they'd otherwise want to keep. My go-to example here is: messages in one channel in a recording are generally understood to have been sent in that order and arrive in that order. Messages in separate channels arrived at the recorder in log time order, but may have arrived at other consumers in a different order, and may have been sent in a different order.

I also think it's powerful to be able to project any field of a message into a new MCAP data stream. If only certain Avro types can be toplevel message types, that rules that out for Avro encoding at least.
I know we support other encodings that don't support this - for example, you can't have a ros1msg string in an MCAP message, you need to wrap it in a std_msgs/msg/String struct first. However, for those ecosystems there are well-known struct wrappers for primitive types, for Avro that's not the way.

defunctzombie · 2024-01-24T01:12:30Z

Closing this for now. Good feedback and learnings but the primary customer for this is ok with their current setup and does not need anything urgent here. Until we get an avro user to review and provide feedback I am more comfortable shelving this.

Add avro to spec

6696178

- Add a rust writer example

github-actions bot deployed to mcap (Preview) October 22, 2023 04:26 View deployment

defunctzombie added 2 commits October 21, 2023 21:26

update schema

81285b7

update

d330fc2

github-actions bot deployed to mcap (Preview) October 22, 2023 04:33 View deployment

add note on arrays of schemas

12c0802

github-actions bot deployed to mcap (Preview) October 22, 2023 22:15 View deployment

defunctzombie marked this pull request as ready for review October 23, 2023 02:56

defunctzombie requested review from jtbandes, wkalt and james-rms October 23, 2023 02:56

defunctzombie changed the title ~~Add avro to spec~~ Add avro to well-known encodings Oct 23, 2023

james-rms reviewed Oct 23, 2023

View reviewed changes

website/docs/spec/registry.md Outdated Show resolved Hide resolved

Update website/docs/spec/registry.md

c99237a

Co-authored-by: james-rms <[email protected]>

github-actions bot deployed to mcap (Preview) October 23, 2023 23:51 View deployment

defunctzombie closed this Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add avro to well-known encodings #993

Add avro to well-known encodings #993

defunctzombie commented Oct 22, 2023 •

edited

Loading

james-rms Oct 23, 2023

defunctzombie Oct 24, 2023

james-rms Oct 23, 2023

james-rms commented Oct 24, 2023

defunctzombie commented Jan 24, 2024


		> Further, a name must be defined before it is used (“before” in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come “before” the messages attribute.)

		You can define a name inline using a single schema object for `data` or an array of schema objects. If the `data` is an array of schemas, the `name` must reference a single

Add avro to well-known encodings #993

Add avro to well-known encodings #993

Conversation

defunctzombie commented Oct 22, 2023 • edited Loading

james-rms Oct 23, 2023

Choose a reason for hiding this comment

defunctzombie Oct 24, 2023

Choose a reason for hiding this comment

james-rms Oct 23, 2023

Choose a reason for hiding this comment

james-rms commented Oct 24, 2023

defunctzombie commented Jan 24, 2024

defunctzombie commented Oct 22, 2023 •

edited

Loading