[Spark] Add support for sorting within partitions when Z-ordering #4006
+104
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which Delta project/connector is this regarding?
Description
Resolves #4000 by introducing a new configuration property
spark.databricks.io.skipping.mdc.sortWithinPartitions
that clusters records in row groups, within Parquet files, based on Z-order or Hilbert curve values. This improves data skipping on the Parquet level. Benchmarks included in the issue demonstrate speedups of approximately 8× and 11× on two different datasets. Please refer to the issue for more details.How was this patch tested?
Added test cases in
MultiDimClusteringSuite.scala
for Hilbert and Z-order curves.Does this PR introduce any user-facing changes?
Yes. This PR introduces a new configuration property
spark.databricks.io.skipping.mdc.sortWithinPartitions
. The property defaults tofalse
, ensuring that existing users remain unaffected unless they opt-in by setting it totrue
.Previous Behavior
Z-ordering did not sort data within partitions.
New Behavior
When the property is enabled,
sortWithinPartitions
is applied afterrepartitionByRange
inMultiDimClustering.scala
.