Skip to content

Adds chunk key encoding to kwargs passed to zarr #10274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions doc/internals/zarr-encoding-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,50 @@ re-open it directly with Zarr:
import shutil

shutil.rmtree("rasm.zarr")

Chunk Key Encoding
------------------

When writing data to Zarr stores, Xarray supports customizing how chunk keys are encoded
through the ``chunk_key_encoding`` parameter in the variable's encoding dictionary. This
is particularly useful when working with Zarr V2 arrays and you need to control the
dimension separator in chunk keys.

For example, to specify a custom separator for chunk keys:

.. jupyter-execute::

import xarray as xr
import numpy as np
from zarr.core.chunk_key_encodings import V2ChunkKeyEncoding

# Create a custom chunk key encoding with "/" as separator
enc = V2ChunkKeyEncoding(separator="/").to_dict()

# Create and write a dataset with custom chunk key encoding
arr = np.ones((42, 100))
ds = xr.DataArray(arr, name="var1").to_dataset()
ds.to_zarr(
"example.zarr",
zarr_format=2,
mode="w",
encoding={"var1": {"chunks": (42, 50), "chunk_key_encoding": enc}},
)

The ``chunk_key_encoding`` option accepts a dictionary that specifies the encoding
configuration. For Zarr V2 arrays, you can use the ``V2ChunkKeyEncoding`` class from
``zarr.core.chunk_key_encodings`` to generate this configuration. This is particularly
useful when you need to ensure compatibility with specific Zarr V2 storage layouts or
when working with tools that expect a particular chunk key format.

.. note::
The ``chunk_key_encoding`` option is only relevant when writing to Zarr stores.
When reading Zarr arrays, Xarray automatically detects and uses the appropriate
chunk key encoding based on the store's format and configuration.

.. jupyter-execute::
:hide-code:

import shutil

shutil.rmtree("example.zarr")
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,8 @@ Bug fixes
By `Mathias Hauser <https://github.com/mathause>`_.
- Variables with no temporal dimension are left untouched by :py:meth:`~xarray.Dataset.convert_calendar`. (:issue:`10266`, :pull:`10268`)
By `Pascal Bourgault <https://github.com/aulemahal>`_.
- Enable ``chunk_key_encoding`` in :py:meth:`~xarray.Dataset.to_zarr` for Zarr v2 Datasets (:pull:`10274`)
By `BrianMichell <https://github.com/BrianMichell>`_.

Documentation
~~~~~~~~~~~~~
Expand Down
1 change: 1 addition & 0 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,7 @@ def extract_zarr_variable_encoding(
"serializer",
"cache_metadata",
"write_empty_chunks",
"chunk_key_encoding",
}
if zarr_format == 3:
valid_encodings.add("fill_value")
Expand Down
33 changes: 33 additions & 0 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -2274,7 +2274,7 @@
# Flaky test. Very open to contributions on fixing this
@pytest.mark.flaky
def test_roundtrip_coordinates(self) -> None:
super().test_roundtrip_coordinates()

Check failure on line 2277 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.13 flaky

TestNetCDF4ViaDaskData.test_roundtrip_coordinates Failed: Timeout (>180.0s) from pytest-timeout.

@requires_cftime
def test_roundtrip_cftime_bnds(self):
Expand Down Expand Up @@ -3691,6 +3691,39 @@
else:
yield {}

def test_chunk_key_encoding_v2(self) -> None:
encoding = {"name": "v2", "configuration": {"separator": "/"}}

# Create a dataset with a variable name containing a period
data = np.ones((4, 4))
original = Dataset({"var1": (("x", "y"), data)})

# Set up chunk key encoding with slash separator
encoding = {
"var1": {
"chunk_key_encoding": encoding,
"chunks": (2, 2),
}
}

# Write to store with custom encoding
with self.create_zarr_target() as store:
original.to_zarr(store, encoding=encoding)

# Verify the chunk keys in store use the slash separator
if not has_zarr_v3:
chunk_keys = [k for k in store.keys() if k.startswith("var1/")]
assert len(chunk_keys) > 0
for key in chunk_keys:
assert "/" in key
assert "." not in key.split("/")[1:] # No dots in chunk coordinates

# Read back and verify data
with xr.open_zarr(store) as actual:
assert_identical(original, actual)
# Verify chunks are preserved
assert actual["var1"].encoding["chunks"] == (2, 2)


@requires_zarr
@pytest.mark.skipif(
Expand Down Expand Up @@ -5272,7 +5305,7 @@
def test_dask_roundtrip(self) -> None:
with create_tmp_file() as tmp:
data = create_test_data()
data.to_netcdf(tmp)

Check failure on line 5308 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.13 flaky

TestDask.test_dask_roundtrip Failed: Timeout (>180.0s) from pytest-timeout.
chunks = {"dim1": 4, "dim2": 4, "dim3": 4, "time": 10}
with open_dataset(tmp, chunks=chunks) as dask_ds:
assert_identical(data, dask_ds)
Expand Down Expand Up @@ -5393,7 +5426,7 @@

def test_cmp_local_file(self) -> None:
with self.create_datasets() as (actual, expected):
assert_equal(actual, expected)

Check failure on line 5429 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.13 all-but-numba

TestPydap.test_cmp_local_file AssertionError: Left and right Dataset objects are not equal Differing data variables: L bears (i, j) |S3 18B b'ind' b'ist' b'ing' b'uis' b'hab' b'le' R bears (i, j) <U4 96B 'ind' 'ist' 'ing' 'uis' 'hab' 'le'

# global attributes should be global attributes on the dataset
assert "NC_GLOBAL" not in actual.attrs
Expand Down Expand Up @@ -5436,7 +5469,7 @@
@requires_dask
def test_dask(self) -> None:
with self.create_datasets(chunks={"j": 2}) as (actual, expected):
assert_equal(actual, expected)

Check failure on line 5472 in xarray/tests/test_backends.py

View workflow job for this annotation

GitHub Actions / ubuntu-latest py3.13 all-but-numba

TestPydap.test_dask AssertionError: Left and right Dataset objects are not equal Differing data variables: L bears (i, j) |S3 18B b'ind' b'ist' b'ing' b'uis' b'hab' b'le' R bears (i, j) <U4 96B 'ind' 'ist' 'ing' 'uis' 'hab' 'le'


@network
Expand Down
Loading