Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert some panics that happen on invalid parquet files to error results #6738

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jp0317
Copy link
Contributor

@jp0317 jp0317 commented Nov 16, 2024

Which issue does this PR close?

This solves some of #6737.

Rationale for this change

Some code changes to replace some panics with proper errors

What changes are included in this PR?

Some codes that lead to panic are converted to returning error results.

Are there any user-facing changes?

Behavior change from panics to errors.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 16, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this, this seems to add a number of untested additional checks, some to very hot codepaths.

I suggest rather than just looking for things that might panic, instead going from a failing test to a fix. This would also better capture the more problematic cases where the reader gets stuck on malformed input, a panic is a good outcome IMO...

@@ -959,17 +959,18 @@ impl ColumnChunkMetaData {
}

/// Returns the offset and length in bytes of the column chunk within the file
pub fn byte_range(&self) -> (u64, u64) {
pub fn byte_range(&self) -> Result<(u64, u64)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking API change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah..is it fine given that it just wraps the return value within Result? The behavior change is just "panic --> error".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is a breaking change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this change. btw i saw that the planned 54.0.0 will have "major api change", how is an api change considered as acceptable breaking change??

parquet/src/format.rs Outdated Show resolved Hide resolved
@@ -1227,6 +1231,10 @@ fn from_thrift_helper(elements: &[SchemaElement], index: usize) -> Result<(usize
if !is_root_node {
builder = builder.with_repetition(rep);
}
} else if !is_root_node {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is based on the comment at line 1230 which says All other types must have one, and the assert at line 1066: assert!(tp.get_basic_info.()has_repetitio())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with this check is unless it is necessary for correctness, there is potential adding it breaks something for someone. Parquet is a very broad ecosystem, lots of writers have interesting interpretations of the specification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it breaks something for someone

if it happened then someone must have already experienced panics. But i get your point, removed this change. thx

@@ -67,7 +67,17 @@ impl<'a> TCompactSliceInputProtocol<'a> {
let mut shift = 0;
loop {
let byte = self.read_byte()?;
in_progress |= ((byte & 0x7F) as u64) << shift;
let val = (byte & 0x7F) as u64;
let val = val.checked_shl(shift).map_or_else(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very performance critical code path, this probably should use wrapping_shl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm afraid wrapping might cause correctness issues...besides, would the checked_shl really make a noticeable performance difference here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't cause correctness issues, and yes it will matter. There are benchmarks that will likely show this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, changed to wrapping_shl, thx!

impl TInputProtocol for TCompactSliceInputProtocol<'_> {
fn read_message_begin(&mut self) -> thrift::Result<TMessageIdentifier> {
unimplemented!()
thrift_unimplemented!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unreachable, a panic is the correct thing to do here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc it "should be unreachable" unless the input file is malformed? I guess this goes back to the discussion on how to handle invalid inputs:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is actually genuinely unreachable, we don't use thrift messages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed back to unimplemented

@jp0317
Copy link
Contributor Author

jp0317 commented Nov 19, 2024

Thanks @tustvold for the review!

this seems to add a number of untested additional checks...

I think the changes are more about converting panics to errors, rather than actual code logic.

looking for things that might panic, instead going from a failing test to a fix

these panics were triggered in my own fuzzing test with invalid parquet files. Nevertheless, i think it's a similar topic of "how to handle invalid inputs" as discussed in #5323. Reading this doc, imho errors better than panics unless it's really something unrecoverable.

@jp0317
Copy link
Contributor Author

jp0317 commented Nov 20, 2024

Hi @tustvold, I removed some changes based on your comment. PTAL, thanks!

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few nits. However, I wonder if perhaps this should be multiple PRs with rationales given for each change.

Comment on lines +620 to +621
let column_orders =
Self::parse_column_orders(t_file_metadata.column_orders, &schema_descr)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this would currently panic, but would one ever prefer to just set column_orders to None and continue? The only impact AFAIK would be statistics being unusable, which would only matter if predicates were in use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! i agree with the setting to None idea. but i guess this worths a separate issue to discuss and fix.

if let Some(min) = min {
if min.len() < len {
return Err(ParquetError::General(
"Insufficient bytes to parse max statistic".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Insufficient bytes to parse max statistic".to_string(),
"Insufficient bytes to parse min statistic".to_string(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching this.

@jp0317
Copy link
Contributor Author

jp0317 commented Nov 21, 2024

Thanks! Just a few nits. However, I wonder if perhaps this should be multiple PRs with rationales given for each change.

Thanks for the review! imho these changes share the same rationale in that they just convert panics to errors

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks correct to me. I personally would prefer errors to panics when processing multiple files.

@jp0317
Copy link
Contributor Author

jp0317 commented Nov 26, 2024

Hi @tustvold could you please take another look at this one? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants