Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): support for zarr-python>=3 #1726

Draft
wants to merge 71 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
0840150
(wip): support for new zarr version
ilan-gold Oct 21, 2024
7876318
Merge branch 'main' into ig/zarr_v3
ilan-gold Nov 18, 2024
892888e
(chore): create setting for write version
ilan-gold Nov 18, 2024
4846e7d
(fix): pathing issue
ilan-gold Nov 18, 2024
4eeb1ec
(chore): use `open_group`
ilan-gold Nov 18, 2024
19b7f41
(fix): another `zarr.open`
ilan-gold Nov 18, 2024
aafcc7a
(fix): `zarr_write_version` -> `zarr_write_format`
ilan-gold Nov 18, 2024
215c761
(feat): batched reading for sparse
ilan-gold Dec 5, 2024
7cf74b2
(fix): object codec
ilan-gold Dec 5, 2024
fb4fabc
Merge branch 'main' into ig/zarr_v3
ilan-gold Dec 5, 2024
4594c52
(fix): revert compressed vectors
ilan-gold Dec 5, 2024
f465a95
(feat): scalar support
ilan-gold Dec 5, 2024
65e736a
(fix): `open_group` for v3
ilan-gold Dec 5, 2024
ae8bd4a
(fix): backed sparse copy method
ilan-gold Dec 5, 2024
5c2a8b3
(fix): add speed-up for zarr by batching indexing
ilan-gold Dec 8, 2024
434d15c
(fix): no `__len__` on new zarr arrays
ilan-gold Dec 8, 2024
688ff9d
(fix): some v3 fixes
ilan-gold Jan 9, 2025
14a226d
Merge branch 'main' into ig/zarr_v3
flying-sheep Jan 10, 2025
4defc32
Specify mode with kwarg
flying-sheep Jan 10, 2025
1687554
chore: typing fixes
flying-sheep Jan 10, 2025
a160642
Merge branch 'ig/zarr_v3' of github.com:scverse/anndata into ig/zarr_v3
ilan-gold Jan 13, 2025
ff52eb3
(fix): more `create_dataset` args
ilan-gold Jan 13, 2025
340539f
Merge branch 'main' into ig/zarr_v3
flying-sheep Jan 14, 2025
1316c12
format->version
flying-sheep Jan 14, 2025
2b19ba1
clear properly
flying-sheep Jan 14, 2025
a12862a
dynamic format/version
flying-sheep Jan 14, 2025
f85b027
Centralize version comparison
flying-sheep Jan 14, 2025
d31d2e5
Fix most create_dataset errors
flying-sheep Jan 14, 2025
f9830ae
unpin zarr everywhere
flying-sheep Jan 14, 2025
f1ca6f7
Almost fix docs
flying-sheep Jan 14, 2025
3dc87a3
(fix): compression test
ilan-gold Jan 14, 2025
3856c9f
(fix): tracking store
ilan-gold Jan 14, 2025
d953f93
(fix): `as_group` `mode` arg
ilan-gold Jan 15, 2025
91ba051
(fix): temporary fix for writing from zarr array
ilan-gold Jan 15, 2025
db7c025
(fix): more context issues
ilan-gold Jan 15, 2025
39a31a3
(fix): no `chunks` `bool` arg
ilan-gold Jan 15, 2025
eec2e63
(fix): return `item` for array
ilan-gold Jan 15, 2025
ed6d2a4
(chore): add issue
ilan-gold Jan 15, 2025
41af7ea
Bump scanpydoc version
flying-sheep Jan 16, 2025
418f063
Merge branch 'main' into ig/zarr_v3
flying-sheep Jan 16, 2025
eb5dc85
Merge branch 'main' into ig/zarr_v3
ilan-gold Jan 20, 2025
3ac48cb
(chore): add zarr v2 test
ilan-gold Jan 20, 2025
ac5b86d
(chore): add warning for zarr v2
ilan-gold Jan 20, 2025
5fbbaef
(fix): fix tracking tests for zarr v2
ilan-gold Jan 20, 2025
1d85eea
(chore): refactor a bit
ilan-gold Jan 20, 2025
80fef78
(fix): revert zarr v2 test
ilan-gold Jan 21, 2025
c394159
(fix): setting on group test
ilan-gold Jan 21, 2025
bc3f2bc
(fix): reopening issue
ilan-gold Jan 24, 2025
ae18de3
(fix): using zip files for backwards compat
ilan-gold Jan 24, 2025
a7d3bf7
Merge branch 'main' into ig/zarr_v3
ilan-gold Jan 24, 2025
08607e8
(fix): temporary fix for `empty`
ilan-gold Jan 24, 2025
9a6b932
(fix): io dispatched keys test
ilan-gold Jan 24, 2025
ae32f2b
(fix): sparse array access tracking
ilan-gold Jan 24, 2025
8cf58e1
(fix): more reopen zarr store
ilan-gold Jan 24, 2025
fedccca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 24, 2025
ecfdcb8
(fix): lazy reading
ilan-gold Jan 27, 2025
4cc63eb
Merge branch 'ig/zarr_v3' of github.com:scverse/anndata into ig/zarr_v3
ilan-gold Jan 27, 2025
c10a99c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 27, 2025
5c9cb70
Merge branch 'main' into ig/zarr_v3
ilan-gold Feb 2, 2025
33fb472
(fix): ensure v2 group is made
ilan-gold Feb 3, 2025
b89d524
(fix): add `visititems`
ilan-gold Feb 3, 2025
ecc4662
(fix): point zarr at main
ilan-gold Feb 3, 2025
f1b45de
Merge branch 'ig/zarr_v3' of github.com:scverse/anndata into ig/zarr_v3
ilan-gold Feb 3, 2025
099a248
(fix): don't provide `VLenUTF8Codec` for default string
ilan-gold Feb 3, 2025
1762697
(fix): doctest
ilan-gold Feb 3, 2025
6aced7e
(fix): move off the zarr main branch
ilan-gold Feb 4, 2025
7c5d2cd
(fix): pin zarr in benchmarks
ilan-gold Feb 4, 2025
b09666a
(feat): zarr v2 handling in tests
ilan-gold Feb 4, 2025
015cb6f
(fix): warning on `zarr.open`
ilan-gold Feb 4, 2025
62b3654
(fix): `create_array` instead of `create_dataset`
ilan-gold Feb 4, 2025
122e311
Merge branch 'main' into ig/zarr_v3
ilan-gold Feb 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ jobs:
python.version: "3.10"
DEPENDENCIES_VERSION: "minimum"
TEST_TYPE: "coverage"

steps:
- task: UsePythonVersion@0
inputs:
Expand Down
4 changes: 3 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,8 @@ def setup(app: Sphinx):
python=("https://docs.python.org/3", None),
scipy=("https://docs.scipy.org/doc/scipy", None),
sklearn=("https://scikit-learn.org/stable", None),
zarr=("https://zarr.readthedocs.io/en/stable/", None),
xarray=("https://docs.xarray.dev/en/stable", None),
zarr=("https://zarr.readthedocs.io/en/v2.18.4/", None),
)
qualname_overrides = {
"h5py._hl.group.Group": "h5py.Group",
Expand All @@ -128,6 +128,8 @@ def setup(app: Sphinx):
"anndata._types.WriteCallback": "anndata.experimental.WriteCallback",
"anndata._types.Read": "anndata.experimental.Read",
"anndata._types.Write": "anndata.experimental.Write",
"zarr.core.array.Array": "zarr.Array",
"zarr.core.group.Group": "zarr.Group",
"anndata.compat.DaskArray": "dask.array.Array",
"anndata.compat.CupyArray": "cupy.ndarray",
"anndata.compat.CupySparseMatrix": "cupyx.scipy.sparse.spmatrix",
Expand Down
2 changes: 1 addition & 1 deletion docs/fileformat-prose.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Using this information, we're able to dispatch onto readers for the different el
## Dense arrays

Dense numeric arrays have the most simple representation on disk,
as they have native equivalents in H5py {doc}`h5py:high/dataset` and Zarr {ref}`Arrays <zarr:tutorial_create>`.
as they have native equivalents in H5py {doc}`h5py:high/dataset` and Zarr {doc}`Arrays <zarr:user-guide/arrays>`.
We can see an example of this with dimensionality reductions stored in the `obsm` group:

`````{tab-set}
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ dependencies = [
# array-api-compat 1.5 has https://github.com/scverse/anndata/issues/1410
"array_api_compat>1.4,!=1.5",
"legacy-api-wrap",
"zarr",
]
dynamic = ["version"]

Expand All @@ -74,7 +75,6 @@ doc = [
"sphinxext.opengraph",
"nbsphinx",
"scanpydoc[theme,typehints] >=0.15.1",
"zarr<3",
"awkward>=2.3",
"IPython", # For syntax highlighting in notebooks
"myst_parser",
Expand All @@ -88,7 +88,6 @@ test = [
"loompy>=3.0.5",
"pytest>=8.2,<8.3.4",
"pytest-cov>=2.10",
"zarr<3",
"matplotlib",
"scikit-learn",
"openpyxl",
Expand Down Expand Up @@ -149,6 +148,7 @@ filterwarnings_when_strict = [
"default:(Observation|Variable) names are not unique. To make them unique:UserWarning",
"default::scipy.sparse.SparseEfficiencyWarning",
"default::dask.array.core.PerformanceWarning",
"default:anndata will no longer support zarr v2:FutureWarning"
]
python_files = "test_*.py"
testpaths = [
Expand Down
9 changes: 8 additions & 1 deletion src/anndata/_core/anndata.py
Original file line number Diff line number Diff line change
Expand Up @@ -1944,7 +1944,7 @@
def write_zarr(
self,
store: MutableMapping | PathLike,
chunks: bool | int | tuple[int, ...] | None = None,
chunks: tuple[int, ...] | None = None,
):
"""\
Write a hierarchical Zarr array store.
Expand All @@ -1958,6 +1958,13 @@
"""
from ..io import write_zarr

# TODO: What is a bool for chunks supposed to do?
if isinstance(chunks, bool):
msg = (

Check warning on line 1963 in src/anndata/_core/anndata.py

View check run for this annotation

Codecov / codecov/patch

src/anndata/_core/anndata.py#L1963

Added line #L1963 was not covered by tests
"Passing `write_zarr(adata, chunks=True)` is no longer supported. "
"Please pass `write_zarr(adata)` instead."
)
raise ValueError(msg)

Check warning on line 1967 in src/anndata/_core/anndata.py

View check run for this annotation

Codecov / codecov/patch

src/anndata/_core/anndata.py#L1967

Added line #L1967 was not covered by tests
write_zarr(store, self, chunks=chunks)

def chunked_X(self, chunk_size: int | None = None):
Expand Down
31 changes: 23 additions & 8 deletions src/anndata/_core/sparse_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@

from .. import abc
from .._settings import settings
from ..compat import H5Group, SpArray, ZarrArray, ZarrGroup, _read_attr
from ..compat import H5Group, SpArray, ZarrArray, ZarrGroup, _read_attr, is_zarr_v2
from .index import _fix_slice_bounds, _subset, unpack_index

if TYPE_CHECKING:
Expand Down Expand Up @@ -73,13 +73,22 @@ def copy(self) -> ss.csr_matrix | ss.csc_matrix:
if isinstance(self.data, ZarrArray):
import zarr

return sparse_dataset(
zarr.open(
if is_zarr_v2():
sparse_group = zarr.open(
store=self.data.store,
mode="r",
chunk_store=self.data.chunk_store, # chunk_store is needed, not clear why
)[Path(self.data.path).parent]
).to_memory()
else:
anndata_group = zarr.open_group(store=self.data.store, mode="r")
sparse_group = anndata_group[
str(
Path(str(self.data.store_path))
.relative_to(str(anndata_group.store_path))
.parent
)
]
return sparse_dataset(sparse_group).to_memory()
return super().copy()

def _set_many(self, i: Iterable[int], j: Iterable[int], x):
Expand Down Expand Up @@ -534,9 +543,9 @@ def append(self, sparse_matrix: ss.csr_matrix | ss.csc_matrix | SpArray) -> None
f"{self.format!r} and {sparse_matrix.format!r}"
)
raise ValueError(msg)
indptr_offset = len(self.group["indices"])
[indptr_offset] = self.group["indices"].shape
if self.group["indptr"].dtype == np.int32:
new_nnz = indptr_offset + len(sparse_matrix.indices)
new_nnz = indptr_offset + sparse_matrix.indices.shape[0]
if new_nnz >= np.iinfo(np.int32).max:
msg = (
"This array was written with a 32 bit intptr, but is now large "
Expand Down Expand Up @@ -567,7 +576,13 @@ def append(self, sparse_matrix: ss.csr_matrix | ss.csc_matrix | SpArray) -> None
data = self.group["data"]
orig_data_size = data.shape[0]
data.resize((orig_data_size + sparse_matrix.data.shape[0],))
data[orig_data_size:] = sparse_matrix.data
# see https://github.com/zarr-developers/zarr-python/discussions/2712 for why we need to read first
append_data = sparse_matrix.data
append_indices = sparse_matrix.indices
if isinstance(sparse_matrix.data, ZarrArray) and not is_zarr_v2():
append_data = append_data[...]
append_indices = append_indices[...]
data[orig_data_size:] = append_data

# indptr
indptr = self.group["indptr"]
Expand All @@ -581,7 +596,7 @@ def append(self, sparse_matrix: ss.csr_matrix | ss.csc_matrix | SpArray) -> None
indices = self.group["indices"]
orig_data_size = indices.shape[0]
indices.resize((orig_data_size + sparse_matrix.indices.shape[0],))
indices[orig_data_size:] = sparse_matrix.indices
indices[orig_data_size:] = append_indices

# Clear cached property
for attr in ["_indptr", "_indices", "_data"]:
Expand Down
13 changes: 11 additions & 2 deletions src/anndata/_io/h5ad.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
_clean_uns,
_decode_structured_array,
_from_fixed_length_strings,
is_zarr_v2,
)
from ..experimental import read_dispatched
from .specs import read_elem, write_elem
Expand All @@ -38,6 +39,7 @@
from typing import Any, Literal

from .._core.file_backing import AnnDataFileManager
from .._types import GroupStorageType

T = TypeVar("T")

Expand Down Expand Up @@ -113,7 +115,7 @@
@report_write_key_on_error
@write_spec(IOSpec("array", "0.2.0"))
def write_sparse_as_dense(
f: h5py.Group,
f: GroupStorageType,
key: str,
value: sparse.spmatrix | BaseCompressedSparseDataset,
*,
Expand All @@ -129,7 +131,14 @@
key = re.sub(r"(.*)(\w(?!.*/))", r"\1_\2", key.rstrip("/"))
else:
del f[key] # Wipe before write
dset = f.create_dataset(key, shape=value.shape, dtype=value.dtype, **dataset_kwargs)
if isinstance(f, h5py.Group) or is_zarr_v2():
dset = f.create_dataset(
key, shape=value.shape, dtype=value.dtype, **dataset_kwargs
)
else:
dset = f.create_array(

Check warning on line 139 in src/anndata/_io/h5ad.py

View check run for this annotation

Codecov / codecov/patch

src/anndata/_io/h5ad.py#L139

Added line #L139 was not covered by tests
key, shape=value.shape, dtype=value.dtype, **dataset_kwargs
)
compressed_axis = int(isinstance(value, sparse.csc_matrix))
for idx in idx_chunks_along_axis(value.shape, compressed_axis, 1000):
dset[idx] = value[idx].toarray()
Expand Down
Loading