Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: version 2 catalog serialization #26183

Merged
merged 5 commits into from
Mar 24, 2025
Merged

feat: version 2 catalog serialization #26183

merged 5 commits into from
Mar 24, 2025

Conversation

hiltontj
Copy link
Contributor

@hiltontj hiltontj commented Mar 21, 2025

This introduces a new version for the catalog file formats (snapshot files and log files). The reason for introducing a new version is to change the serialization/deserialization format from bitcode to JSON. See #26180.

The approach taken was to copy the existing type definitions for both log and snapshot files into two places: a v1 module and a v2 module. Going forward:

  • Types defined in v1 should not be changed. They are only there to enable deserialization of existing bitcode-serialized catalog files.
  • Types defined in v2 can be modified in a backward-compatible manor, and new types can be added to the v2 modules

With this PR, old files are not overwritten. The server does not migrate any files on startup. See #26183 (comment)

Closes #26180

Changed from using bitcode for the catalog to JSON. This applies to both
the log files as well as snapshots.

This required copying existing code into two places:
- v1
- latest

The types from latest are used as the "in-memory" types that we can work
with throughout the codebase. The v1 types are only there for posterity
and giving the ability to deserialize catalog files that predate this
change; they are not meant to be used throughout the code.
@hiltontj hiltontj added the v3 label Mar 21, 2025
@hiltontj hiltontj self-assigned this Mar 21, 2025
Copy link
Contributor

@praveen-influx praveen-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a question

I think I get the idea of introducing the latest version, but it is not clear when moving from existing V1 version to latest is supposed to happen.

Since catalog logs aren't written periodically (this is my understanding, I could be wrong), will that mean unless there's a change to catalog itself the user won't be migrated to json (leaving them to use bitcode?). If that's the intention, that is fine but wanted to check if I'm missing something in this PR itself.

@hiltontj
Copy link
Contributor Author

I think I get the idea of introducing the latest version, but it is not clear when moving from existing V1 version to latest is supposed to happen.

The idea is that existing files written with v1 are still loaded, but any new files are written with v2 (maybe I should be explicit about the module name and call it v2, not latest?). So, it doesn't force an update of existing files to the new version.

Since catalog logs aren't written periodically (this is my understanding, I could be wrong), will that mean unless there's a change to catalog itself the user won't be migrated to json (leaving them to use bitcode?).

Correct on both points. Existing bitcode v1 files will remain. Eventually, a catalog checkpoint file will be written in v2 (JSON), and all log files that precede it will never be loaded again.

We could force a checkpoint on startup, e.g., if files versioned before the latest get loaded, and that way, the old stuff should never need to be deserialized again.

Copy link
Member

@pauldix pauldix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of replicated code, but I think it's ultimately cleaner to have the clearly delineated versions

@hiltontj hiltontj marked this pull request as ready for review March 24, 2025 14:48
@praveen-influx
Copy link
Contributor

maybe I should be explicit about the module name and call it v2, not latest?)

It's mainly my unfamiliarity on when the log/checkpoint is serialized that tripped me. It can stay as latest.

We could force a checkpoint on startup, e.g., if files versioned before the latest get loaded, and that way, the old stuff should never need to be deserialized again.

I'm not sure about the cycle, but if it loads old version files from disk and then writes to checkpoint file (at some periodic interval) and deletes the old log files then it is just a matter of time before old log files disappear anyway. In that case we probably don't need to force a snapshot.

@hiltontj hiltontj merged commit f5144b8 into main Mar 24, 2025
13 checks passed
@hiltontj hiltontj deleted the hiltontj/catalog-format branch March 24, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a different serialization format in catalog
3 participants