Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/batch creation #2665

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
8faf994
sketch out batch creation routine
d-v-b Dec 11, 2024
8952911
scratch state of easy batch creation
d-v-b Dec 18, 2024
de3c594
Merge branch 'main' of https://github.com/d-v-b/zarr-python into feat…
d-v-b Jan 1, 2025
c700e39
rename tupleize keys
d-v-b Jan 3, 2025
986d68b
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Jan 3, 2025
97b768f
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Jan 7, 2025
b6bf2dd
Merge branch 'feat/batch-creation' of github.com:d-v-b/zarr-python in…
d-v-b Jan 7, 2025
57ceb64
tests and proper implementation for create_nodes and create_hierarchy
d-v-b Jan 7, 2025
181d3d0
privatize
d-v-b Jan 7, 2025
e8e6107
use Posixpath instead of Path in tests; avoid redundant cast
d-v-b Jan 7, 2025
4f2c954
restore cast
d-v-b Jan 7, 2025
dd4174c
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Jan 7, 2025
cf72834
pureposixpath instead of posixpath
d-v-b Jan 7, 2025
e2cff8c
group-level create_hierarchy
d-v-b Jan 7, 2025
0912ecb
docstring
d-v-b Jan 7, 2025
04f7922
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Jan 8, 2025
089feef
sketch out from_flat for groups
d-v-b Jan 8, 2025
116ab87
better concurrency for v2
d-v-b Jan 9, 2025
246f862
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Jan 9, 2025
e38c1ca
revert change to default concurrency
d-v-b Jan 9, 2025
2fb9083
create root correctly
d-v-b Jan 9, 2025
b099fba
working _from_flat
d-v-b Jan 10, 2025
64b54bf
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Jan 10, 2025
4562e86
working dict serialization for _ImplicitGroupMetadata
d-v-b Jan 10, 2025
cdfd5de
remove implicit group metadata, and add some key name normalization
d-v-b Jan 15, 2025
036fd2a
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Jan 15, 2025
787d6bf
add path normalization routines
d-v-b Jan 22, 2025
d07435b
use _join_paths for safer path concatenation
d-v-b Jan 22, 2025
29ecce7
Merge branch 'feat/batch-creation' of github.com:d-v-b/zarr-python in…
d-v-b Jan 22, 2025
63dd07f
handle overwrite
d-v-b Jan 22, 2025
15c4a7e
rename _from_flat to _create_rooted_hierarchy, add sync version
d-v-b Jan 22, 2025
645a447
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Jan 22, 2025
bd9afd1
add test for _create_rooted_hierarchy when the output should be an ar…
d-v-b Jan 22, 2025
8be3876
increase coverage, one way or another
d-v-b Jan 22, 2025
06e5482
remove replace kwarg for _set_return_key
d-v-b Jan 22, 2025
37186d6
shield lines from coverage
d-v-b Jan 22, 2025
ed4e846
add some tests
d-v-b Jan 22, 2025
02ac91d
lint
d-v-b Jan 22, 2025
f6a08a0
improve coverage with more tests
d-v-b Jan 22, 2025
9d2f642
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Jan 22, 2025
ed0d52a
Merge branch 'main' into feat/batch-creation
d-v-b Jan 25, 2025
661678f
use store + path instead of StorePath for hierarchy api
d-v-b Jan 28, 2025
7a718d5
docstrings
d-v-b Jan 28, 2025
23bfef5
docstrings
d-v-b Jan 28, 2025
619eeb5
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Jan 28, 2025
5282534
release notes
d-v-b Jan 28, 2025
6507e43
refactor sync / async functions, and make tests more compact accordingly
d-v-b Jan 28, 2025
6b56342
keyerror -> filenotfounderror
d-v-b Jan 28, 2025
3be878d
keyerror -> filenotfounderror, fixup
d-v-b Jan 28, 2025
774eeda
Merge branch 'main' into feat/batch-creation
d-v-b Jan 28, 2025
f3c506f
add top-level exports
d-v-b Jan 28, 2025
60379a7
Merge branch 'feat/batch-creation' of github.com:d-v-b/zarr-python in…
d-v-b Jan 28, 2025
32e06fa
mildly refactor node input validation
d-v-b Jan 29, 2025
8bd0b57
simplify path normalization
d-v-b Jan 29, 2025
1bb6578
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Feb 2, 2025
d05a43c
refactor to separate sync and async routines
d-v-b Feb 2, 2025
29bab74
remove semaphore kwarg, and add test for concurrency limit sensitivity
d-v-b Feb 2, 2025
2f02c26
wire up semaphore correctly, thanks to a test
d-v-b Feb 2, 2025
6ab8339
export read_node
d-v-b Feb 2, 2025
9b97c95
docstrings
d-v-b Feb 2, 2025
e546519
docstrings
d-v-b Feb 2, 2025
24eab3a
read_node -> get_node
d-v-b Feb 2, 2025
2b02996
Merge branch 'main' into feat/batch-creation
d-v-b Feb 7, 2025
a1e75b9
Merge branch 'main' into feat/batch-creation
d-v-b Feb 10, 2025
fff280c
Merge branch 'main' into feat/batch-creation
d-v-b Feb 11, 2025
545cacb
Update src/zarr/api/synchronous.py
d-v-b Feb 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/2665.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Adds functions for concurrently creating multiple arrays and groups.
6 changes: 6 additions & 0 deletions src/zarr/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
create,
create_array,
create_group,
create_hierarchy,
create_nodes,
create_rooted_hierarchy,
empty,
empty_like,
full,
Expand Down Expand Up @@ -50,6 +53,9 @@
"create",
"create_array",
"create_group",
"create_hierarchy",
"create_nodes",
"create_rooted_hierarchy",
"empty",
"empty_like",
"full",
Expand Down
14 changes: 13 additions & 1 deletion src/zarr/api/asynchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,15 @@
_warn_write_empty_chunks_kwarg,
parse_dtype,
)
from zarr.core.group import AsyncGroup, ConsolidatedMetadata, GroupMetadata
from zarr.core.group import (
AsyncGroup,
ConsolidatedMetadata,
GroupMetadata,
create_hierarchy,
create_nodes,
create_rooted_hierarchy,
get_node,
)
from zarr.core.metadata import ArrayMetadataDict, ArrayV2Metadata, ArrayV3Metadata
from zarr.core.metadata.v2 import _default_compressor, _default_filters
from zarr.errors import NodeTypeValidationError
Expand All @@ -48,10 +56,14 @@
"copy_store",
"create",
"create_array",
"create_hierarchy",
"create_nodes",
"create_rooted_hierarchy",
"empty",
"empty_like",
"full",
"full_like",
"get_node",
"group",
"load",
"ones",
Expand Down
151 changes: 148 additions & 3 deletions src/zarr/api/synchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,19 @@
import zarr.api.asynchronous as async_api
import zarr.core.array
from zarr._compat import _deprecate_positional_args
from zarr.abc.store import Store
from zarr.core.array import Array, AsyncArray
from zarr.core.group import Group
from zarr.core.sync import sync
from zarr.core.group import Group, GroupMetadata, _parse_async_node
from zarr.core.sync import _collect_aiterator, sync

if TYPE_CHECKING:
from collections.abc import Iterable
from collections.abc import Iterable, Iterator

import numpy as np
import numpy.typing as npt

from zarr.abc.codec import Codec
from zarr.abc.store import Store
from zarr.api.asynchronous import ArrayLike, PathLike
from zarr.core.array import (
CompressorsLike,
Expand All @@ -36,6 +38,7 @@
ShapeLike,
ZarrFormat,
)
from zarr.core.metadata import ArrayV2Metadata, ArrayV3Metadata
from zarr.storage import StoreLike

__all__ = [
Expand All @@ -46,10 +49,14 @@
"copy_store",
"create",
"create_array",
"create_hierarchy",
"create_nodes",
"create_rooted_hierarchy",
"empty",
"empty_like",
"full",
"full_like",
"get_node",
"group",
"load",
"ones",
Expand Down Expand Up @@ -1132,3 +1139,141 @@ def zeros_like(a: ArrayLike, **kwargs: Any) -> Array:
The new array.
"""
return Array(sync(async_api.zeros_like(a, **kwargs)))


def create_hierarchy(
store: Store,
path: str,
nodes: dict[str, GroupMetadata | ArrayV2Metadata | ArrayV3Metadata],
overwrite: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that overwrite is undocumented here. In other functions it'd documented as

Whether to overwrite existing nodes. Default is ``False``.

Could you update that description to say what happens when an existing node is found with overwrite=False? Is an error raised, or is the node not updated?

allow_root: bool = True,
) -> Iterator[Group | Array]:
"""
Create a complete zarr hierarchy from a collection of metadata objects.

Groups that are implicitly defined by the input will be created as needed.

This function takes a parsed hierarchy dictionary and creates all the nodes in the hierarchy
concurrently. Arrays and Groups are yielded in the order they are created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the creation order deterministic? If not, then perhaps state that the order isn't guaranteed.


Parameters
----------
store : Store
The storage backend to use.
path : str
The name of the root of the created hierarchy. Every key in ``nodes`` will be prefixed with
``path`` prior to creating nodes.
nodes : dict[str, GroupMetadata | ArrayV3Metadata | ArrayV2Metadata]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage example (and I guess the type) will probably make this clear, but it'd be good to clarify whether this is the flat or nested representation. IIUC, it's the flat representation so the keys are like ["group/x", "group/y", ...].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the exact syntax of whether or not leading or trailing slashes are expected would be helpful too.

A dictionary defining the hierarchy. The keys are the paths of the nodes
in the hierarchy, and the values are the metadata of the nodes. The
metadata must be either an instance of GroupMetadata, ArrayV3Metadata
or ArrayV2Metadata.
allow_root : bool
Whether to allow a root node to be created. If ``False``, attempting to create a root node
will result in an error. Use this option when calling this function as part of a method
defined on ``AsyncGroup`` instances, because in this case the root node has already been
created.

Yields
------
Group | Array
The created nodes in the order they are created.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth adding a usage example here.

coro = async_api.create_hierarchy(
store=store, path=path, nodes=nodes, overwrite=overwrite, allow_root=allow_root
)

for result in sync(_collect_aiterator(coro)):
yield _parse_async_node(result)


def create_nodes(
*, store: Store, path: str, nodes: dict[str, GroupMetadata | ArrayV2Metadata | ArrayV3Metadata]
) -> Iterator[Group | Array]:
"""Create a collection of arrays and / or groups concurrently.

Note: no attempt is made to validate that these arrays and / or groups collectively form a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the main / only difference between create_nodes and create_hierarchy?

Copy link
Member

@TomNicholas TomNicholas Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, we could just use create_nodes alone.

IIUC the advantage of create_hierarchy is "safety" in the public zarr API. But to support writing multiple arbitrary new arrays/groups to an existing store concurrently requires the generality of create_nodes, so we need that one, and we hence have some "unsafe" public zarr API regardless.

I'm not too worried about using "unsafe" API in xarray because DataTree should prevent users even creating DataTrees with invalid zarr group hierarchies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the main / only difference between create_nodes and create_hierarchy?

yes. create_hierarchy attempts to model the rules of the zarr spec, and so it will not take an input like {'a': ArrayMetadata, 'a/b': ArrayMetadata}, which would nest an array inside another array. Whereas create_nodes doesn't do any input checking at all. it just creates nodes.

valid Zarr hierarchy. It is the responsibility of the caller of this function to ensure that
the ``nodes`` parameter satisfies any correctness constraints.

Parameters
----------
store : Store
The storage backend to use.
path : str
The name of the root of the created hierarchy. Every key in ``nodes`` will be prefixed with
``path`` prior to creating nodes.
nodes : dict[str, GroupMetadata | ArrayV3Metadata | ArrayV2Metadata]
A dictionary defining the hierarchy. The keys are the paths of the nodes
in the hierarchy, and the values are the metadata of the nodes. The
metadata must be either an instance of GroupMetadata, ArrayV3Metadata
or ArrayV2Metadata.

Yields
------
Group | Array
The created nodes.
"""
coro = async_api.create_nodes(store=store, path=path, nodes=nodes)

for result in sync(_collect_aiterator(coro)):
yield _parse_async_node(result)


def create_rooted_hierarchy(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the use case for this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function returns a single zarr array or group (the root of the hierarchy); create_hierarchy returns an iterator over all the created nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use case is for when someone wants to create an entire hierarchy, and get as a return a value a handle to the root of that hierarchy. I suspect this is actually more typical than users wanting an iterator over everything in the hierarchy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @jhamman - create_rooted_hierarchy doesn't really need to exist if it's pretty easy to get the root from create_hierarchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the root is easy, in a computer science sense, but it's also tedious. Look at the source code for create_rooted_hierarchy and tell me if you want every zarr user to write this themselves (or propose a simplification that renders it untedious :) ):

https://github.com/d-v-b/zarr-python/blob/545cacb543a8e2c1e634530a1fdb530d6faa23f7/src/zarr/core/group.py#L3555-L3583

*,
store: Store,
path: str,
nodes: dict[str, GroupMetadata | ArrayV2Metadata | ArrayV3Metadata],
overwrite: bool = False,
) -> Group | Array:
"""
Create a Zarr hierarchy with a root, and return the root node, which could be a ``Group``
or ``Array`` instance.

Parameters
----------
store : Store
The storage backend to use.
path : str
The name of the root of the created hierarchy. Every key in ``nodes`` will be prefixed with
``path`` prior to creating nodes.
nodes : dict[str, GroupMetadata | ArrayV3Metadata | ArrayV2Metadata]
A dictionary defining the hierarchy. The keys are the paths of the nodes
in the hierarchy, and the values are the metadata of the nodes. The
metadata must be either an instance of GroupMetadata, ArrayV3Metadata
or ArrayV2Metadata.
overwrite : bool
Whether to overwrite existing nodes. Default is ``False``.

Returns
-------
Group | Array
"""
async_node = sync(
async_api.create_rooted_hierarchy(store=store, path=path, nodes=nodes, overwrite=overwrite)
)
return _parse_async_node(async_node)


def get_node(store: Store, path: str, zarr_format: ZarrFormat) -> Array | Group:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. this seems like a helper function but one that may not want to include as part of the public api

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it, but I think a function for getting an array or group is pretty useful to end users

"""
Get an Array or Group from a path in a Store.

Parameters
----------
store : Store
The store-like object to read from.
path : str
The path to the node to read.
zarr_format : {2, 3}
The zarr format of the node to read.

Returns
-------
Array | Group
"""

return _parse_async_node(
sync(async_api.get_node(store=store, path=path, zarr_format=zarr_format))
)
Loading