-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: read/write f16 for Arrow #5003
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I've covered all the necessary areas for just reading/writing arrow
Won't include changes for the ColumnReader API files in interest of scope/PR size (still WIP anyway)
@@ -243,6 +243,8 @@ pub fn to_thrift(stats: Option<&Statistics>) -> Option<TStatistics> { | |||
distinct_count: stats.distinct_count().map(|value| value as i64), | |||
max_value: None, | |||
min_value: None, | |||
is_max_value_exact: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to apache/parquet-format@31f92c7
Unsure if there is more work required to support this, that might require a separate issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems might be covered by/related to #5037
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before we merge this we should get a file containing this type added to parquet-testing, so that we can ensure interoperability with other implementations. Otherwise this looks reasonable to me
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
There does seem to be this PR for parquet-testing: apache/parquet-testing#40 |
Just realized I need to also fix the statistics handling as well, otherwise I think it might write incorrect stats for files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR scope extended to include all support for f16 (including in ColumnReader API)
Also reverted formatting changes to not pollute the PR
fn is_nan<T: ParquetValueType>(descr: &ColumnDescriptor, val: &T) -> bool { | ||
match T::PHYSICAL_TYPE { | ||
Type::FLOAT | Type::DOUBLE => val != val, | ||
Type::FIXED_LEN_BYTE_ARRAY if descr.logical_type() == Some(LogicalType::Float16) => { | ||
let val = val.as_bytes(); | ||
let val = f16::from_le_bytes([val[0], val[1]]); | ||
val.is_nan() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Purely from type T
we can't determine if the FixedLenByteArray represents a Float16 logical type, hence need ColumnDescriptor
param to provide this information, to subsequently check NaN for f16
if let Some(LogicalType::Float16) = descr.logical_type() { | ||
let a = a.as_bytes(); | ||
let a = f16::from_le_bytes([a[0], a[1]]); | ||
let b = b.as_bytes(); | ||
let b = f16::from_le_bytes([b[0], b[1]]); | ||
return a > b; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't compare as bytes, compare as f16's
#[test] | ||
fn test_column_writer_check_float16_nan_middle() { | ||
let input = [f16::ONE, f16::NAN, f16::ONE + f16::ONE] | ||
.into_iter() | ||
.map(|s| ByteArray::from(s).into()) | ||
.collect::<Vec<_>>(); | ||
|
||
let stats = float16_statistics_roundtrip(&input); | ||
assert!(stats.has_min_max_set()); | ||
assert!(stats.is_min_max_backwards_compatible()); | ||
assert_eq!(stats.min(), &ByteArray::from(f16::ONE)); | ||
assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE)); | ||
} | ||
|
||
#[test] | ||
fn test_float16_statistics_nan_middle() { | ||
let input = [f16::ONE, f16::NAN, f16::ONE + f16::ONE] | ||
.into_iter() | ||
.map(|s| ByteArray::from(s).into()) | ||
.collect::<Vec<_>>(); | ||
|
||
let stats = float16_statistics_roundtrip(&input); | ||
assert!(stats.has_min_max_set()); | ||
assert!(stats.is_min_max_backwards_compatible()); | ||
assert_eq!(stats.min(), &ByteArray::from(f16::ONE)); | ||
assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE)); | ||
} | ||
|
||
#[test] | ||
fn test_float16_statistics_nan_start() { | ||
let input = [f16::NAN, f16::ONE, f16::ONE + f16::ONE] | ||
.into_iter() | ||
.map(|s| ByteArray::from(s).into()) | ||
.collect::<Vec<_>>(); | ||
|
||
let stats = float16_statistics_roundtrip(&input); | ||
assert!(stats.has_min_max_set()); | ||
assert!(stats.is_min_max_backwards_compatible()); | ||
assert_eq!(stats.min(), &ByteArray::from(f16::ONE)); | ||
assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE)); | ||
} | ||
|
||
#[test] | ||
fn test_float16_statistics_nan_only() { | ||
let input = [f16::NAN, f16::NAN] | ||
.into_iter() | ||
.map(|s| ByteArray::from(s).into()) | ||
.collect::<Vec<_>>(); | ||
|
||
let stats = float16_statistics_roundtrip(&input); | ||
assert!(!stats.has_min_max_set()); | ||
assert!(stats.is_min_max_backwards_compatible()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensuring NaN's are handled correctly in statistics (i.e. are ignored)
Thank you for this, in the interests of saving time I intend to review this after apache/parquet-testing#40 has merged. |
@benibus Do you feel like taking a look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!. I'm not familiar with this codebase and I have very limited Rust knowledge... but on a high level, this seems pretty good to me.
@@ -1170,6 +1188,7 @@ fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> { | |||
mod tests { | |||
use crate::{file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, format::BoundaryOrder}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there tests for ensuring that BoundaryOrder
is deduced correctly (i.e. not like a normal fixed binary)? I'd imagine that would be automatically handled by the Float16-specific comparators included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I don't think I've considered that (nor truncation of statistics). Thanks for pointing that out, I'll check what changes will need to be made for those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in regards to truncation, is it intended to disallow truncation of f16 stats or it should take place anyway?
Nowhere in the new spec for f16 does it mention this case, and neither did I see relevant changes in apache/arrow#36073, but perhaps I missed it?
e.g. if user set the truncation for column index length to 1 byte, should we still truncate f16 to one byte since its underlying representation if Fixed len byte array, or should leave it as 2 bytes as truncating f16 doesn't make sense since it doesn't follow the sort order for fixed len byte arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to ignore the truncation limit in this case, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so will need to insert special case for Float16 to ensure its stats won't get truncated
Also regarding the BoundaryOrder
it seems it isn't being derived in general yet, for any other types:
arrow-rs/parquet/src/file/metadata.rs
Lines 888 to 889 in 31b5724
// TODO: calc the order for all pages in this column | |
boundary_order: BoundaryOrder, |
I couldn't find where it might be set, so this could be a separate issue to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good and well tested, thank you 👍
I'm going to merge this as I think it moves things forward. Any follow up from #5003 (comment) I think can safely be handled in a subsequent PR |
…arquet-testing (#38753) ### Rationale for this change Validates compatibility between implementations when reading `Float16` columns. ### What changes are included in this PR? - Bumps `parquet-testing` commit to latest to use the recently-added files - Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003 ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: #38751 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…s in parquet-testing (apache#38753) ### Rationale for this change Validates compatibility between implementations when reading `Float16` columns. ### What changes are included in this PR? - Bumps `parquet-testing` commit to latest to use the recently-added files - Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003 ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: apache#38751 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…arquet-testing (#38753) ### Rationale for this change Validates compatibility between implementations when reading `Float16` columns. ### What changes are included in this PR? - Bumps `parquet-testing` commit to latest to use the recently-added files - Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003 ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: #38751 Authored-by: benibus <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Which issue does this PR close?
Closes #4986
Rationale for this change
What changes are included in this PR?
Allow reading and writing f16 type from parquet to/from Arrow recordbatches
Also support in ColumnReader API
And handle writing statistics properly
Are there any user-facing changes?