-
-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/batch creation #2665
base: main
Are you sure you want to change the base?
feat/batch creation #2665
Conversation
…/batch-creation
…into feat/batch-creation
…at/batch-creation
…to feat/batch-creation
…at/batch-creation
…into feat/batch-creation
this is now working, so I would appreciate some feedback on the design. The basic design is the same as what I outlined earlier in this PR: there are two new functions that take a approachbasically the same as concurrent group members listing, except we don't need any recursion. I'm scheduling writes and using new functions
Implicit groupsPartial hierarchies like streaming v2 vs v3 node creationcreating v3 arrays / groups requires writing 1 metadata document, but v2 requires 2. To get the most concurrency I await the write of each metadata document separately, which means that Overlap with metadata consolidation logicthere's a lot of similarity between the stuff in this PR and routines used for consolidated metadata. it would be great to find ways to factor out some of the overlap areas still to do:
|
Yes, this would be used in Xarray. |
this PR adds a few functions that have async implementations and sync wrappers, like |
…into feat/batch-creation
@zarr-developers/python-core-devs I'd like to get movement on this PR this week. Does anyone have time to review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this a review. My main suggestion is to reduce the scope of new public facing api to just create_hierarchy
. We can come back and expose additional utilities here but for a first cut, I think its useful to limit the surface area of the API.
Really excited to get this into Xarray!
------ | ||
Group | Array | ||
The created nodes in the order they are created. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worth adding a usage example here.
yield _parse_async_node(result) | ||
|
||
|
||
def create_rooted_hierarchy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the use case for this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function returns a single zarr array or group (the root of the hierarchy); create_hierarchy
returns an iterator over all the created nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use case is for when someone wants to create an entire hierarchy, and get as a return a value a handle to the root of that hierarchy. I suspect this is actually more typical than users wanting an iterator over everything in the hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @jhamman - create_rooted_hierarchy
doesn't really need to exist if it's pretty easy to get the root from create_hierarchy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting the root is easy, in a computer science sense, but it's also tedious. Look at the source code for create_rooted_hierarchy
and tell me if you want every zarr user to write this themselves (or propose a simplification that renders it untedious :) ):
return _parse_async_node(async_node) | ||
|
||
|
||
def get_node(store: Store, path: str, zarr_format: ZarrFormat) -> Array | Group: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. this seems like a helper function but one that may not want to include as part of the public api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove it, but I think a function for getting an array or group is pretty useful to end users
overwrite=overwrite, | ||
allow_root=False, | ||
): | ||
yield node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is missing a test and feels important enough to call out.
allow_root: bool = True, | ||
) -> AsyncIterator[AsyncGroup | AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]]: | ||
""" | ||
Create a complete zarr hierarchy concurrently. Groups that are implicitly defined by the input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see suggested edits above to the api function version of this docstring
yield node | ||
|
||
|
||
async def create_nodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
advocating to make this a private function and not export it to zarr.api
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the position on using private zarr functions from within xarray?
@@ -1443,7 +1459,454 @@ def test_delitem_removes_children(store: Store, zarr_format: ZarrFormat) -> None | |||
|
|||
|
|||
@pytest.mark.parametrize("store", ["memory"], indirect=True) | |||
def test_group_members_performance(store: MemoryStore) -> None: | |||
@pytest.mark.parametrize("impl", ["async", "sync"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested to see if this pattern holds up. Running our sync
function inside pytests IO loop may eventually lead to problems. But let's see.
Co-authored-by: Joe Hamman <[email protected]>
path : str | ||
The name of the root of the created hierarchy. Every key in ``nodes`` will be prefixed with | ||
``path`` prior to creating nodes. | ||
nodes : dict[str, GroupMetadata | ArrayV3Metadata | ArrayV2Metadata] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage example (and I guess the type) will probably make this clear, but it'd be good to clarify whether this is the flat or nested representation. IIUC, it's the flat representation so the keys are like ["group/x", "group/y", ...]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the exact syntax of whether or not leading or trailing slashes are expected would be helpful too.
Groups that are implicitly defined by the input will be created as needed. | ||
|
||
This function takes a parsed hierarchy dictionary and creates all the nodes in the hierarchy | ||
concurrently. Arrays and Groups are yielded in the order they are created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the creation order deterministic? If not, then perhaps state that the order isn't guaranteed.
) -> Iterator[Group | Array]: | ||
"""Create a collection of arrays and / or groups concurrently. | ||
|
||
Note: no attempt is made to validate that these arrays and / or groups collectively form a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the main / only difference between create_nodes
and create_hierarchy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, we could just use create_nodes
alone.
IIUC the advantage of create_hierarchy
is "safety" in the public zarr API. But to support writing multiple arbitrary new arrays/groups to an existing store concurrently requires the generality of create_nodes
, so we need that one, and we hence have some "unsafe" public zarr API regardless.
I'm not too worried about using "unsafe" API in xarray because DataTree
should prevent users even creating DataTrees
with invalid zarr group hierarchies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the main / only difference between
create_nodes
andcreate_hierarchy
?
yes. create_hierarchy
attempts to model the rules of the zarr spec, and so it will not take an input like {'a': ArrayMetadata, 'a/b': ArrayMetadata}
, which would nest an array inside another array. Whereas create_nodes
doesn't do any input checking at all. it just creates nodes.
store: Store, | ||
path: str, | ||
nodes: dict[str, GroupMetadata | ArrayV2Metadata | ArrayV3Metadata], | ||
overwrite: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that overwrite
is undocumented here. In other functions it'd documented as
Whether to overwrite existing nodes. Default is ``False``.
Could you update that description to say what happens when an existing node is found with overwrite=False
? Is an error raised, or is the node not updated?
@@ -57,3 +57,10 @@ class NodeTypeValidationError(MetadataValidationError): | |||
This can be raised when the value is invalid or unexpected given the context, | |||
for example an 'array' node when we expected a 'group'. | |||
""" | |||
|
|||
|
|||
class RootedHierarchyError(BaseZarrError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this name a bit confusing, but I might not understand the context.
If we're saying "you've tried to create a root, but it already exists" then perhaps a RootExistsError
? But I think it's actually more like, you've tried to insert a root at a level below the root? So maybe something like RootAsChildError
, NestedRootError
, ChildRootError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
class NestedRootError(BaseZarrError):
"""
Exception raised when attempting to create a root node relative to a pre-existing root node.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds good.
This PR adds a few routines for creating a collection of arrays and groups (i.e., a dict with path-like keys and
ArrayMetadata
/GroupMetadata
values) in storage concurrently.create_hierarchy
takes a dict representation of a hierarchy, parses that dict to ensure that there are no implicit groups (creating group metadata documents as needed), then invokescreate_nodes
and yields the resultscreate_nodes
concurrently writes metadata documents to storage, and yields the createdAsyncArray
/AsyncGroup
instances.I still need to wire up concurrency limits, and test them.
TODO: