Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update parquet encoding docs #5053

Merged
merged 2 commits into from
Nov 7, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Nov 7, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 7, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you @tustvold

@@ -55,7 +55,7 @@ The `parquet` crate provides the following features which may be enabled in your

## Parquet Feature Status

- [x] All encodings supported
- [x] All encodings except BYTE_STREAM_SPLIT supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can leave a link to #4102 to see if we can find someone to help

/// Not all encodings are valid for all types. These enums are also used to specify the
/// encoding of definition and repetition levels.
///
/// Additionally, not all encodings are well supported by the broader ecosystem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Additionally, not all encodings are well supported by the broader ecosystem.
/// Additionally, not all encodings are well supported by the broader Parquet ecosystem, such as parquet-mr.

Should we call out parquet-mr? In general it would be great to add some specificity if possible (aka what does "broad" mean, can we give examples of widely used systems that only practically support PLAIN / RLE / RLE_DICTIONARY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mainly newer systems like DuckDB that have poor support for these encodings, I have opted to reword this slightly to emphasise that the defaults are supported by everything, as opposed to casting aspersions on the other encodings.

@@ -304,6 +316,18 @@ impl FromStr for Encoding {
// Mirrors `parquet::CompressionCodec`

/// Supported compression algorithms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Supported compression algorithms.
/// Supported block compression algorithms. Defaults to ``Self::UNCOMPRESSED``

If we are changing the docs anyway, maybe we can also call out that the default compression is NONE as I at least have been surprised by that in the past

@tustvold tustvold merged commit 747dcbf into apache:master Nov 7, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants