[Spark] Add support for sorting within partitions when Z-ordering #4006

maltevelin · 2024-12-29T22:38:17Z

Which Delta project/connector is this regarding?

Description

Resolves #4000 by introducing a new configuration property spark.databricks.io.skipping.mdc.sortWithinPartitions that clusters records in row groups, within Parquet files, based on Z-order or Hilbert curve values. This improves data skipping on the Parquet level. Benchmarks included in the issue demonstrate speedups of approximately 8× and 11× on two different datasets. Please refer to the issue for more details.

How was this patch tested?

Added test cases in MultiDimClusteringSuite.scala for Hilbert and Z-order curves.

Does this PR introduce any user-facing changes?

Yes. This PR introduces a new configuration property spark.databricks.io.skipping.mdc.sortWithinPartitions. The property defaults to false, ensuring that existing users remain unaffected unless they opt-in by setting it to true.

Previous Behavior
Z-ordering did not sort data within partitions.

New Behavior
When the property is enabled, sortWithinPartitions is applied after repartitionByRange in MultiDimClustering.scala.

Signed-off-by: Malte Velin <[email protected]>

…value. Signed-off-by: Malte Velin <[email protected]>

…ording to curve. Signed-off-by: Malte Velin <[email protected]>

maltevelin added 3 commits December 28, 2024 20:10

Add configuration property to toggle sorting output on Z-order value.

59b0449

Signed-off-by: Malte Velin <[email protected]>

If configuration property is set to true then sort output on Z-order …

3644342

…value. Signed-off-by: Malte Velin <[email protected]>

Add unit tests ensuring that records in each partition are sorted acc…

23881d3

…ording to curve. Signed-off-by: Malte Velin <[email protected]>

maltevelin mentioned this pull request Dec 29, 2024

[Feature Request] [Spark] Optionally sort within partitions when Z-ordering #4000

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Add support for sorting within partitions when Z-ordering #4006

[Spark] Add support for sorting within partitions when Z-ordering #4006

maltevelin commented Dec 29, 2024 •

edited

Loading

[Spark] Add support for sorting within partitions when Z-ordering #4006

Are you sure you want to change the base?

[Spark] Add support for sorting within partitions when Z-ordering #4006

Conversation

maltevelin commented Dec 29, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

maltevelin commented Dec 29, 2024 •

edited

Loading