Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: read/write f16 for Arrow #5003

Merged
merged 15 commits into from
Nov 13, 2023
Merged

Conversation

Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Oct 29, 2023

Which issue does this PR close?

Closes #4986

Rationale for this change

What changes are included in this PR?

Allow reading and writing f16 type from parquet to/from Arrow recordbatches

Also support in ColumnReader API

And handle writing statistics properly

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 29, 2023
Copy link
Contributor Author

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I've covered all the necessary areas for just reading/writing arrow

Won't include changes for the ColumnReader API files in interest of scope/PR size (still WIP anyway)

@@ -243,6 +243,8 @@ pub fn to_thrift(stats: Option<&Statistics>) -> Option<TStatistics> {
distinct_count: stats.distinct_count().map(|value| value as i64),
max_value: None,
min_value: None,
is_max_value_exact: None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to apache/parquet-format@31f92c7

Unsure if there is more work required to support this, that might require a separate issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems might be covered by/related to #5037

@Jefffrey Jefffrey marked this pull request as ready for review October 29, 2023 10:58
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before we merge this we should get a file containing this type added to parquet-testing, so that we can ensure interoperability with other implementations. Otherwise this looks reasonable to me

parquet/src/arrow/arrow_writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/arrow/arrow_reader/mod.rs Outdated Show resolved Hide resolved
@tustvold tustvold added the api-change Changes to the arrow API label Oct 30, 2023
@Jefffrey
Copy link
Contributor Author

I think before we merge this we should get a file containing this type added to parquet-testing, so that we can ensure interoperability with other implementations. Otherwise this looks reasonable to me

There does seem to be this PR for parquet-testing: apache/parquet-testing#40

@Jefffrey Jefffrey marked this pull request as draft November 7, 2023 02:58
@Jefffrey
Copy link
Contributor Author

Jefffrey commented Nov 7, 2023

Just realized I need to also fix the statistics handling as well, otherwise I think it might write incorrect stats for files

Copy link
Contributor Author

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR scope extended to include all support for f16 (including in ColumnReader API)

Also reverted formatting changes to not pollute the PR

Comment on lines +981 to +988
fn is_nan<T: ParquetValueType>(descr: &ColumnDescriptor, val: &T) -> bool {
match T::PHYSICAL_TYPE {
Type::FLOAT | Type::DOUBLE => val != val,
Type::FIXED_LEN_BYTE_ARRAY if descr.logical_type() == Some(LogicalType::Float16) => {
let val = val.as_bytes();
let val = f16::from_le_bytes([val[0], val[1]]);
val.is_nan()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely from type T we can't determine if the FixedLenByteArray represents a Float16 logical type, hence need ColumnDescriptor param to provide this information, to subsequently check NaN for f16

Comment on lines +1052 to +1058
if let Some(LogicalType::Float16) = descr.logical_type() {
let a = a.as_bytes();
let a = f16::from_le_bytes([a[0], a[1]]);
let b = b.as_bytes();
let b = f16::from_le_bytes([b[0], b[1]]);
return a > b;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't compare as bytes, compare as f16's

Comment on lines +2119 to +2171
#[test]
fn test_column_writer_check_float16_nan_middle() {
let input = [f16::ONE, f16::NAN, f16::ONE + f16::ONE]
.into_iter()
.map(|s| ByteArray::from(s).into())
.collect::<Vec<_>>();

let stats = float16_statistics_roundtrip(&input);
assert!(stats.has_min_max_set());
assert!(stats.is_min_max_backwards_compatible());
assert_eq!(stats.min(), &ByteArray::from(f16::ONE));
assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE));
}

#[test]
fn test_float16_statistics_nan_middle() {
let input = [f16::ONE, f16::NAN, f16::ONE + f16::ONE]
.into_iter()
.map(|s| ByteArray::from(s).into())
.collect::<Vec<_>>();

let stats = float16_statistics_roundtrip(&input);
assert!(stats.has_min_max_set());
assert!(stats.is_min_max_backwards_compatible());
assert_eq!(stats.min(), &ByteArray::from(f16::ONE));
assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE));
}

#[test]
fn test_float16_statistics_nan_start() {
let input = [f16::NAN, f16::ONE, f16::ONE + f16::ONE]
.into_iter()
.map(|s| ByteArray::from(s).into())
.collect::<Vec<_>>();

let stats = float16_statistics_roundtrip(&input);
assert!(stats.has_min_max_set());
assert!(stats.is_min_max_backwards_compatible());
assert_eq!(stats.min(), &ByteArray::from(f16::ONE));
assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE));
}

#[test]
fn test_float16_statistics_nan_only() {
let input = [f16::NAN, f16::NAN]
.into_iter()
.map(|s| ByteArray::from(s).into())
.collect::<Vec<_>>();

let stats = float16_statistics_roundtrip(&input);
assert!(!stats.has_min_max_set());
assert!(stats.is_min_max_backwards_compatible());
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensuring NaN's are handled correctly in statistics (i.e. are ignored)

@tustvold
Copy link
Contributor

tustvold commented Nov 7, 2023

Thank you for this, in the interests of saving time I intend to review this after apache/parquet-testing#40 has merged.

@pitrou
Copy link
Member

pitrou commented Nov 9, 2023

@benibus Do you feel like taking a look at this PR?

Copy link

@benibus benibus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!. I'm not familiar with this codebase and I have very limited Rust knowledge... but on a high level, this seems pretty good to me.

@@ -1170,6 +1188,7 @@ fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
mod tests {
use crate::{file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, format::BoundaryOrder};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there tests for ensuring that BoundaryOrder is deduced correctly (i.e. not like a normal fixed binary)? I'd imagine that would be automatically handled by the Float16-specific comparators included here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I don't think I've considered that (nor truncation of statistics). Thanks for pointing that out, I'll check what changes will need to be made for those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in regards to truncation, is it intended to disallow truncation of f16 stats or it should take place anyway?

Nowhere in the new spec for f16 does it mention this case, and neither did I see relevant changes in apache/arrow#36073, but perhaps I missed it?

e.g. if user set the truncation for column index length to 1 byte, should we still truncate f16 to one byte since its underlying representation if Fixed len byte array, or should leave it as 2 bytes as truncating f16 doesn't make sense since it doesn't follow the sort order for fixed len byte arrays?

@benibus

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to ignore the truncation limit in this case, IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so will need to insert special case for Float16 to ensure its stats won't get truncated

Also regarding the BoundaryOrder it seems it isn't being derived in general yet, for any other types:

// TODO: calc the order for all pages in this column
boundary_order: BoundaryOrder,

I couldn't find where it might be set, so this could be a separate issue to be done.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and well tested, thank you 👍

@tustvold
Copy link
Contributor

I'm going to merge this as I think it moves things forward. Any follow up from #5003 (comment) I think can safely be handled in a subsequent PR

@tustvold tustvold merged commit 7ba36b0 into apache:master Nov 13, 2023
16 checks passed
@Jefffrey Jefffrey deleted the parquet_f16_arrow branch November 14, 2023 08:46
@Jefffrey
Copy link
Contributor Author

Jefffrey commented Nov 14, 2023

I've raised #5075 and #5074 to follow up on the discussion points

zeroshade pushed a commit to apache/arrow that referenced this pull request Nov 17, 2023
…arquet-testing (#38753)

### Rationale for this change

Validates compatibility between implementations when reading `Float16` columns.

### What changes are included in this PR?

- Bumps `parquet-testing` commit to latest to use the recently-added files
- Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

* Closes: #38751

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…s in parquet-testing (apache#38753)

### Rationale for this change

Validates compatibility between implementations when reading `Float16` columns.

### What changes are included in this PR?

- Bumps `parquet-testing` commit to latest to use the recently-added files
- Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

* Closes: apache#38751

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
kou pushed a commit to apache/arrow-go that referenced this pull request Aug 30, 2024
…arquet-testing (#38753)

### Rationale for this change

Validates compatibility between implementations when reading `Float16` columns.

### What changes are included in this PR?

- Bumps `parquet-testing` commit to latest to use the recently-added files
- Adds reader tests for C++ and Go in the same vein as apache/arrow-rs#5003

### Are these changes tested?

Yes

### Are there any user-facing changes?

No

* Closes: #38751

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Float16/Half-float logical type to Parquet
4 participants