Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: handle signed floating point zeros in statistics #5047

Closed
Jefffrey opened this issue Nov 7, 2023 · 2 comments
Closed

Parquet: handle signed floating point zeros in statistics #5047

Jefffrey opened this issue Nov 7, 2023 · 2 comments
Labels
bug parquet Changes to the parquet crate

Comments

@Jefffrey
Copy link
Contributor

Jefffrey commented Nov 7, 2023

Describe the bug

https://github.com/apache/parquet-format/blob/46cc3a0647d301bb9579ca8dd2cc356caf2a72d2/README.md?plain=1#L162-L178

    * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
      signed zeros.   The details are documented in the
      [Thrift definition](src/main/thrift/parquet.thrift) in the
      `ColumnOrder` union. They are summarized here but the Thrift definition
      is considered authoritative:
      * NaNs should not be written to min or max statistics fields.
      * If the computed max value is zero (whether negative or positive),
        `+0.0` should be written into the max statistics field.
      * If the computed min value is zero (whether negative or positive),
        `-0.0` should be written into the min statistics field.

      For backwards compatibility when reading files:
      * If the min is a NaN, it should be ignored.
      * If the max is a NaN, it should be ignored.
      * If the min is +0, the row group may contain -0 values as well.
      * If the max is -0, the row group may contain +0 values as well.
      * When looking for NaN values, min and max should be ignored.

Specifically the points about the computed max and min values when they are negative/positive zero.

To Reproduce

Add test to parquet/src/column/writer/mod.rs:

    #[test]
    fn test_float_statistics_zero_only() {
        let stats = statistics_roundtrip::<FloatType>(&[0.0]);
        assert!(stats.has_min_max_set());
        assert!(stats.is_min_max_backwards_compatible());
        if let Statistics::Float(stats) = stats {
            assert_eq!(stats.min(), &-0.0);
            assert!(stats.min().is_sign_negative());
            assert_eq!(stats.max(), &0.0);
            assert!(stats.max().is_sign_positive());
        } else {
            panic!("expecting Statistics::Float");
        }
    }

Run:

parquet$ cargo test --lib column::writer::tests::test_float_statistics_zero_only -- --nocapture --exact
    Finished test [unoptimized + debuginfo] target(s) in 0.06s
     Running unittests src/lib.rs (/media/jeffrey/1tb_860evo_ssd/.cargo_target_cache/debug/deps/parquet-b466af98c7f74484)

running 1 test
thread 'column::writer::tests::test_float_statistics_zero_only' panicked at parquet/src/column/writer/mod.rs:2121:13:
assertion failed: stats.min().is_sign_negative()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test column::writer::tests::test_float_statistics_zero_only ... FAILED

failures:

failures:
    column::writer::tests::test_float_statistics_zero_only

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 617 filtered out; finished in 0.00s

error: test failed, to rerun pass `--lib`

Expected behavior

Test should succeed

Additional context

@Jefffrey Jefffrey added the bug label Nov 7, 2023
@Jefffrey
Copy link
Contributor Author

Jefffrey commented Nov 7, 2023

I plan to take this on, maybe after #5003 is merged so can do f16, f32, f64 all at once

@tustvold tustvold changed the title Parquet: writing zero to statistics should follow spec Parquet: handle signed floating point zeros in statistics Nov 7, 2023
@tustvold
Copy link
Contributor

tustvold commented Nov 7, 2023

Sigh... I wish parquet just used total ordering rather than this mess of special casing. If nothing else it makes actually using the statistics correctly very subtle

@tustvold tustvold closed this as completed Nov 7, 2023
@tustvold tustvold added the parquet Changes to the parquet crate label Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants