Skip to content

Variant: Rust API to Read Variant Values #7423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #6736
alamb opened this issue Apr 18, 2025 · 8 comments · Fixed by #7535
Closed
Tracked by #6736

Variant: Rust API to Read Variant Values #7423

alamb opened this issue Apr 18, 2025 · 8 comments · Fixed by #7535
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Apr 18, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The first part of supporting the Variant type in Parquet and Arrow is
programmatic access to values encoded with the binary format described in
[VariantEncoding.md]. This ticket covers the API to read such values, but not
creating such values, or representing it using arrow or parquet which are
covered in other tickets

Describe the solution you'd like
I would like a Rust API, similar to the Json::Value and similar APIs to dynamically access variant values.

Here is some example binary data for testing:

Describe alternatives you've considered

I think a Rust enum approach with references would be a good model.

I suggest creating a new crate, arrow-variant, and marking it as
experimental, etc saying it will contain breaking changes for the next several
releases (maybe we can even version it 0.1, etc)

For example:

Sketch of structures

/// Variant value. May contain references to metadata and value
/// 'a is lifetime for metadata
/// 'b is lifetime for value
pub enum Variant<'a, 'b> {
  Variant::Null,
  Variant::Int8
  ...
  // strings are stored in the value and thus have references to that value
  Variant::String(&'b str),
  Variant::ShortString(&'b str),
  // Objects and Arrays need the metadata and values, so store both.
  Variant::Object(VariantObject<'a, 'b>),
  VariantArray(VariantArray<'a, 'b>)
}

/// Wrapper over Variant Metadata
pub struct VariantMetadata<'a> {
  metadata: &'a[u8],
  // perhaps access to header fields like dict length and is_sorted
}

/// Represents a Variant Object with references to the underlying metadata
/// and value fields
pub enum VariantObject<'a, 'b> {
  // pointer to metadata
  metadata: VariantMetadata<'a>,
  // pointer to value
  value: &'a [u8],
}

Creating Variants from buffers

// Each variant has a metadata and value buffer:
let metadata: &[u8] = ...;
let value: &[u8] = ....;
// The Rust API should NOT require allocations or copy the metadata/values
let variant = Variant::try_new(metadata, value)?;

Working with Primitive Variants

// Act based on the type of variant
match variant {
  Variant::Int8(val) => println!("The value was int8: {val}"),
  ...
  Variant::SmallString(val) => println!("The value was a small string: {val}"),
  ...
  Variant::Object(object) => {
    println("The variant was in object. The fields are:");
    for (field_name, field_value) in object.fields()? {
      // The inner field value is also a variant
      match field_value {
        Variant::...
      }
    }
  }
  // similarly for Variant::Array
}

I personally suggest doing this over a few PRs:

  1. Scaffolding: Variant struct/enum, support a few basic variant primtive types
  2. Basic nested type support: basic support for objects
  3. Array support: support for arrays
  4. Complete APIs, etc

Additional context

Open Questions:When should validation be done?

I do think there should be an API like:

/// ensure that metadata and value are valid according to the Variant spec, returns error if not.
Variant::validate(&metadata, &value)?;

However, the API sketched above proposes doing validation on access (when the
values are accessed). An alternate approach would be to validate everything on
creation and then use unchecked APIs during access.

I think validating once upfront is better if most fields are accessed or certain
fields are read multiple times. For the usecase where only some fields are read
I think verifying on access would be faster.

The spec also allows metadata to contain dictionary values that do not appear as
struct names in the variant value itself, so eager validation would potentially
verify string data uncessairly.

I suggest starting with an API that is fallible (aka creating a Variant or
accessing a field returns Result<Variant>. We can always add unsafe versions
of the APIs for usecases where validation overhead is significant (e.g. writing
utf8 validation for field names when writing json), and justified with benchmarks

@mkarbo
Copy link
Contributor

mkarbo commented May 19, 2025

We'd like to pick this up, can you assign this one to me @alamb?

@alamb
Copy link
Contributor Author

alamb commented May 19, 2025

Thanks @mkarbo !

@alamb
Copy link
Contributor Author

alamb commented May 19, 2025

FYI @PinkCrow007

@alamb
Copy link
Contributor Author

alamb commented May 19, 2025

BTW there are a few known issues with the example variant values apache/parquet-testing#75 in parquet-testing:

Specifically

I think they will be relatively easy to solve / workaround for the time being, but I wanted to bring them to your attention

@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

@scovich had some great comments on #7452 (comment) that I wanted to copy/paste into this ticket perhaps for wider discussion:

I have not reviewed the code carefully at all yet, and what follows is a general observation based on the inherent nature of variant data and rust notions of safety:

It will be really tempting to have "efficient" code that e.g. uses from_utf8_unchecked to extract a &str from a &[u8], or to use indexing operations like v[10] to extract bytes. But variant data is generally untrusted user input and whatever Variant struct/enum we define will become the first -- and often only -- line of defense against malicious or malformed input.

Hopefully we can code carefully, with the goal that sizes and/or contents of metadata and value slices will never cause a panic?

Additinoally, it seems like we have a few choices for values such as strings and decimals even a right-sized byte slice can contain invalid values:

  1. Return obviously unvalidated values, e.g. &[u8] instead of &str for strings, and &[u8] instead of whatever VariantDecimal struct we might otherwise define -- leaving the user responsible to finish the conversion as (un)safely as they deem prudent.
  2. Return ostensibly validated values, with (safe) checked and (unsafe) unchecked constructors and/or getters that let the user choose the one they deem appropriate.

I personally favor the latter approach (safe and easy to use, even if not always the absolutely max efficient), but the topic probably needs a wider discussion.

@alamb
Copy link
Contributor Author

alamb commented May 20, 2025

I personally favor the latter approach (safe and easy to use, even if not always the absolutely max efficient), but the topic probably needs a wider discussion.

I likewise think the "safe / return error by default" is the right model. I also agree it should be an API goal that there should be no panics on invalid variant data (it should return an error instead)

@mkarbo
Copy link
Contributor

mkarbo commented May 20, 2025

Yeah I agree, thanks for pointing it out

@alamb
Copy link
Contributor Author

alamb commented May 21, 2025

BTW I think @mapleFU has made a PR with a proposed C/C++ API for accessing variants which might have some additional inspiration:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants