-
Notifications
You must be signed in to change notification settings - Fork 934
Variant: Rust API to Read Variant Values #7423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We'd like to pick this up, can you assign this one to me @alamb? |
Thanks @mkarbo ! |
FYI @PinkCrow007 |
BTW there are a few known issues with the example variant values apache/parquet-testing#75 in parquet-testing: Specifically
I think they will be relatively easy to solve / workaround for the time being, but I wanted to bring them to your attention |
@scovich had some great comments on #7452 (comment) that I wanted to copy/paste into this ticket perhaps for wider discussion:
|
I likewise think the "safe / return error by default" is the right model. I also agree it should be an API goal that there should be no panics on invalid variant data (it should return an error instead) |
Yeah I agree, thanks for pointing it out |
BTW I think @mapleFU has made a PR with a proposed C/C++ API for accessing variants which might have some additional inspiration: |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The first part of supporting the Variant type in Parquet and Arrow is
programmatic access to values encoded with the binary format described in
[VariantEncoding.md]. This ticket covers the API to read such values, but not
creating such values, or representing it using arrow or parquet which are
covered in other tickets
Describe the solution you'd like
I would like a Rust API, similar to the Json::Value and similar APIs to dynamically access variant values.
Here is some example binary data for testing:
Describe alternatives you've considered
I think a Rust enum approach with references would be a good model.
I suggest creating a new crate,
arrow-variant
, and marking it asexperimental, etc saying it will contain breaking changes for the next several
releases (maybe we can even version it 0.1, etc)
For example:
Sketch of structures
Creating
Variants
from buffersWorking with Primitive
Variants
I personally suggest doing this over a few PRs:
Variant
struct/enum, support a few basic variant primtive typesAdditional context
Open Questions:When should validation be done?
I do think there should be an API like:
However, the API sketched above proposes doing validation on access (when the
values are accessed). An alternate approach would be to validate everything on
creation and then use unchecked APIs during access.
I think validating once upfront is better if most fields are accessed or certain
fields are read multiple times. For the usecase where only some fields are read
I think verifying on access would be faster.
The spec also allows metadata to contain dictionary values that do not appear as
struct names in the variant value itself, so eager validation would potentially
verify string data uncessairly.
I suggest starting with an API that is fallible (aka creating a Variant or
accessing a field returns
Result<Variant>
. We can always add unsafe versionsof the APIs for usecases where validation overhead is significant (e.g. writing
utf8 validation for field names when writing json), and justified with benchmarks
The text was updated successfully, but these errors were encountered: