Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Add support for sorting within partitions when Z-ordering #4006

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

maltevelin
Copy link

@maltevelin maltevelin commented Dec 29, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Resolves #4000 by introducing a new configuration property spark.databricks.io.skipping.mdc.sortWithinPartitions that clusters records in row groups, within Parquet files, based on Z-order or Hilbert curve values. This improves data skipping on the Parquet level. Benchmarks included in the issue demonstrate speedups of approximately 8× and 11× on two different datasets. Please refer to the issue for more details.

How was this patch tested?

Added test cases in MultiDimClusteringSuite.scala for Hilbert and Z-order curves.

Does this PR introduce any user-facing changes?

Yes. This PR introduces a new configuration property spark.databricks.io.skipping.mdc.sortWithinPartitions. The property defaults to false, ensuring that existing users remain unaffected unless they opt-in by setting it to true.

Previous Behavior
Z-ordering did not sort data within partitions.

New Behavior
When the property is enabled, sortWithinPartitions is applied after repartitionByRange in MultiDimClustering.scala.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] [Spark] Optionally sort within partitions when Z-ordering
1 participant