-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read nested Parquet 2-level lists correctly #6757
Conversation
parquet/src/schema/types.rs
Outdated
|
||
/// Returns `true` if this type is annotated as a list. | ||
pub fn is_list(&self) -> bool { | ||
match self.is_group() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since .is_group()
returns a boolean wouldn't it be better semantics to put it inside a branch conditional just using if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Rust has warped me...I feel like I'm doing it wrong if I use if
😅. But you're right that it would be more succinct.
use arrow::datatypes::ToByteSlice; | ||
|
||
let testdata = arrow::util::test_util::parquet_test_data(); | ||
// message my_record { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thank you for putting a comment with the test data. I like this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix CI 👍
I wish I could 😅. It appears that test is borked for everyone due to some upstream shenanigans. And now fixed by #6745 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, mostly just some minor nits
|| repeated_field.name() == "array" | ||
|| repeated_field.name() == format!("{}_tuple", list_type.name()) | ||
|| (!repeated_field.is_list() | ||
&& !repeated_field.has_single_repeated_child() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&& !repeated_field.has_single_repeated_child() | |
&& items[0].get_basic_info().repetition() != Repetition::REPEATED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do this because repetition()
can theoretically panic, even though by this point in the schema every node should have a repetition. Perhaps panicking here is warranted if the schema is invalid. I'll change if you'd prefer the panic (or return an error à la #6738).
let items = repeated_field.get_fields(); | ||
if items.len() != 1 | ||
|| repeated_field.name() == "array" | ||
|| repeated_field.name() == format!("{}_tuple", list_type.name()) | ||
|| (!repeated_field.is_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check necessary, given we're in visit_list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're testing if the child of the list we're processing is also LIST
annotated.
optional group my_list (LIST) { <---- this is `list_type`
repeated group array (LIST) { <---- this is `repeated_field`
repeated int32 array;
};
}
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Looks like an improvement to me -- thank you @etseidl , @tustvold and @devanbenz |
Which issue does this PR close?
Closes #6756.
Rationale for this change
See issue.
What changes are included in this PR?
Modifies both the arrow and record readers to check for nested lists before triggering the rule that repeated groups named "array" are treated as
list<OneTuple>
.Are there any user-facing changes?
Changes the interpretation of some legacy schemas.