-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert some panics that happen on invalid parquet files to error results #6738
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, this seems to add a number of untested additional checks, some to very hot codepaths.
I suggest rather than just looking for things that might panic, instead going from a failing test to a fix. This would also better capture the more problematic cases where the reader gets stuck on malformed input, a panic is a good outcome IMO...
parquet/src/file/metadata/mod.rs
Outdated
@@ -959,17 +959,18 @@ impl ColumnChunkMetaData { | |||
} | |||
|
|||
/// Returns the offset and length in bytes of the column chunk within the file | |||
pub fn byte_range(&self) -> (u64, u64) { | |||
pub fn byte_range(&self) -> Result<(u64, u64)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking API change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah..is it fine given that it just wraps the return value within Result? The behavior change is just "panic --> error".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is a breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this change. btw i saw that the planned 54.0.0 will have "major api change", how is an api change considered as acceptable breaking change??
parquet/src/schema/types.rs
Outdated
@@ -1227,6 +1231,10 @@ fn from_thrift_helper(elements: &[SchemaElement], index: usize) -> Result<(usize | |||
if !is_root_node { | |||
builder = builder.with_repetition(rep); | |||
} | |||
} else if !is_root_node { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is based on the comment at line 1230 which says All other types must have one
, and the assert at line 1066: assert!(tp.get_basic_info.()has_repetitio())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with this check is unless it is necessary for correctness, there is potential adding it breaks something for someone. Parquet is a very broad ecosystem, lots of writers have interesting interpretations of the specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it breaks something for someone
if it happened then someone must have already experienced panics. But i get your point, removed this change. thx
parquet/src/thrift.rs
Outdated
@@ -67,7 +67,17 @@ impl<'a> TCompactSliceInputProtocol<'a> { | |||
let mut shift = 0; | |||
loop { | |||
let byte = self.read_byte()?; | |||
in_progress |= ((byte & 0x7F) as u64) << shift; | |||
let val = (byte & 0x7F) as u64; | |||
let val = val.checked_shl(shift).map_or_else( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very performance critical code path, this probably should use wrapping_shl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm afraid wrapping might cause correctness issues...besides, would the checked_shl really make a noticeable performance difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't cause correctness issues, and yes it will matter. There are benchmarks that will likely show this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, changed to wrapping_shl, thx!
parquet/src/thrift.rs
Outdated
impl TInputProtocol for TCompactSliceInputProtocol<'_> { | ||
fn read_message_begin(&mut self) -> thrift::Result<TMessageIdentifier> { | ||
unimplemented!() | ||
thrift_unimplemented!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be unreachable, a panic is the correct thing to do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc it "should be unreachable" unless the input file is malformed? I guess this goes back to the discussion on how to handle invalid inputs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it is actually genuinely unreachable, we don't use thrift messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed back to unimplemented
Thanks @tustvold for the review!
I think the changes are more about converting panics to errors, rather than actual code logic.
these panics were triggered in my own fuzzing test with invalid parquet files. Nevertheless, i think it's a similar topic of "how to handle invalid inputs" as discussed in #5323. Reading this doc, imho errors better than panics unless it's really something unrecoverable. |
Hi @tustvold, I removed some changes based on your comment. PTAL, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just a few nits. However, I wonder if perhaps this should be multiple PRs with rationales given for each change.
let column_orders = | ||
Self::parse_column_orders(t_file_metadata.column_orders, &schema_descr)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this would currently panic, but would one ever prefer to just set column_orders
to None
and continue? The only impact AFAIK would be statistics being unusable, which would only matter if predicates were in use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! i agree with the setting to None
idea. but i guess this worths a separate issue to discuss and fix.
parquet/src/file/statistics.rs
Outdated
if let Some(min) = min { | ||
if min.len() < len { | ||
return Err(ParquetError::General( | ||
"Insufficient bytes to parse max statistic".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Insufficient bytes to parse max statistic".to_string(), | |
"Insufficient bytes to parse min statistic".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catching this.
Thanks for the review! imho these changes share the same rationale in that they just convert panics to errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks correct to me. I personally would prefer errors to panics when processing multiple files.
Hi @tustvold could you please take another look at this one? Thanks! |
Which issue does this PR close?
This solves some of #6737.
Rationale for this change
Some code changes to replace some panics with proper errors
What changes are included in this PR?
Some codes that lead to panic are converted to returning error results.
Are there any user-facing changes?
Behavior change from panics to errors.