Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/batch creation #2665

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Jan 7, 2025

This PR adds a few routines for creating a collection of arrays and groups (i.e., a dict with path-like keys and ArrayMetadata / GroupMetadata values) in storage concurrently.

  • create_hierarchy takes a dict representation of a hierarchy, parses that dict to ensure that there are no implicit groups (creating group metadata documents as needed), then invokes create_nodes and yields the results
  • create_nodes concurrently writes metadata documents to storage, and yields the created AsyncArray / AsyncGroup instances.

I still need to wire up concurrency limits, and test them.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@d-v-b d-v-b requested review from jhamman and dcherian January 7, 2025 13:27
@normanrz normanrz added this to the After 3.0.0 milestone Jan 7, 2025
@dstansby dstansby added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Jan 10, 2025

this is now working, so I would appreciate some feedback on the design.

The basic design is the same as what I outlined earlier in this PR: there are two new functions that take a dict[path, GroupMetadata | ArrayMetadata] like {'a': GroupMetadata(zarr_format=3), 'a/b': ArrayMetadata(...)} and concurrently persist those metadata documents to storage, resulting in a hierarchy on disk that looks like the dict.

approach

basically the same as concurrent group members listing, except we don't need any recursion. I'm scheduling writes and using as_completed to yield Arrays / Groups when they are available.

new functions

  • create_nodes is low-level and doesn't do any checking of its input, so it will happily create invalid hierarchies, e.g. nesting groups inside arrays, or mixing v2 and v3 metadata, and it won't create intermediate groups, either.

  • create_hierarchy is higher level, it parses the input, checking it for invalid hierarchies, and inserting implicit groups as needed.

  • Group.create_hierarchy is a new method on the Group / AsyncGroup classes that takes a hierarchy dict and creates the nodes specified in that dict at locations relative to the path of the group instance. the return value is dict[str, AsyncGroup | AsyncArray], but I guess it also doesn't have tor return anything, or it could be an async iterator, so that you can interact with the nodes as they are formed. This is flexible right now, but I think the iterator idea is nice.

  • _from_flat (names welcome) is a new function that creates a group entirely from a hierarchy dict + a store. that dict must specify a root group, otherwise an exception is raised. We could revise this to create a root group if one is not specified. Open to suggestions here.

Implicit groups

Partial hierarchies like {'a': GroupMetadata(), 'a/b/c': ArrayMetadata(...)} implicitly denote a group at a/b. When creating such a hierarchy, if we find an existing group at a/b, then we don't need to create a new one. So in the context of modeling a hierarchy, implicit groups are a little special -- by not specifying the properties of the group, the user / application is tolerant of any group being there. So I introduced a subclass of GroupMetadata called _ImplicitGroupMetadata, which can be inserted into a hierarchy dict to explicitly denote groups that don't need to be written if one already exists. _ImplicitGroupMetadata is just like GroupMetadata except it errors if you try to set any parameter except zarr_format.

streaming v2 vs v3 node creation

creating v3 arrays / groups requires writing 1 metadata document, but v2 requires 2. To get the most concurrency I await the write of each metadata document separately, which means that foo/.zattrs might resolve before foo/.zarray. So in the v2 case I only yield up an array / group when both documents were written.

Overlap with metadata consolidation logic

there's a lot of similarity between the stuff in this PR and routines used for consolidated metadata. it would be great to find ways to factor out some of the overlap areas

still to do:

  • write some more tests (checking that implicit groups don't get written if a group already exists)
  • handle overwriting. I think the plan here is, if overwrite is False, then we do a check before any writing to ensure that there are no conflicts between the proposed hierarchy and the stuff that actually exists in storage. this check will involve more IO.

@dcherian
Copy link
Contributor

question: should we export a sync version of create_hierarchy from the top-level zarr namespace?

Yes, this would be used in Xarray.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 28, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Jan 28, 2025

this PR adds a few functions that have async implementations and sync wrappers, like create_hierarchy_a (async) and create_hierarchy. Instead of putting (async, sync) pairs in the same module with slightly different names, I wonder if we should split the group module into an async-only namespace and a sync namespace (that imports stuff from the async namespace)? Then we don't need to mangle function names.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 2, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 11, 2025

@zarr-developers/python-core-devs I'd like to get movement on this PR this week. Does anyone have time to review this?

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a review. My main suggestion is to reduce the scope of new public facing api to just create_hierarchy. We can come back and expose additional utilities here but for a first cut, I think its useful to limit the surface area of the API.

Really excited to get this into Xarray!

src/zarr/core/group.py Outdated Show resolved Hide resolved
src/zarr/api/synchronous.py Outdated Show resolved Hide resolved
------
Group | Array
The created nodes in the order they are created.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth adding a usage example here.

yield _parse_async_node(result)


def create_rooted_hierarchy(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the use case for this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function returns a single zarr array or group (the root of the hierarchy); create_hierarchy returns an iterator over all the created nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use case is for when someone wants to create an entire hierarchy, and get as a return a value a handle to the root of that hierarchy. I suspect this is actually more typical than users wanting an iterator over everything in the hierarchy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @jhamman - create_rooted_hierarchy doesn't really need to exist if it's pretty easy to get the root from create_hierarchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the root is easy, in a computer science sense, but it's also tedious. Look at the source code for create_rooted_hierarchy and tell me if you want every zarr user to write this themselves (or propose a simplification that renders it untedious :) ):

https://github.com/d-v-b/zarr-python/blob/545cacb543a8e2c1e634530a1fdb530d6faa23f7/src/zarr/core/group.py#L3555-L3583

return _parse_async_node(async_node)


def get_node(store: Store, path: str, zarr_format: ZarrFormat) -> Array | Group:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. this seems like a helper function but one that may not want to include as part of the public api

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it, but I think a function for getting an array or group is pretty useful to end users

overwrite=overwrite,
allow_root=False,
):
yield node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is missing a test and feels important enough to call out.

allow_root: bool = True,
) -> AsyncIterator[AsyncGroup | AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]]:
"""
Create a complete zarr hierarchy concurrently. Groups that are implicitly defined by the input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see suggested edits above to the api function version of this docstring

yield node


async def create_nodes(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advocating to make this a private function and not export it to zarr.api.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the position on using private zarr functions from within xarray?

@@ -1443,7 +1459,454 @@ def test_delitem_removes_children(store: Store, zarr_format: ZarrFormat) -> None


@pytest.mark.parametrize("store", ["memory"], indirect=True)
def test_group_members_performance(store: MemoryStore) -> None:
@pytest.mark.parametrize("impl", ["async", "sync"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested to see if this pattern holds up. Running our sync function inside pytests IO loop may eventually lead to problems. But let's see.

path : str
The name of the root of the created hierarchy. Every key in ``nodes`` will be prefixed with
``path`` prior to creating nodes.
nodes : dict[str, GroupMetadata | ArrayV3Metadata | ArrayV2Metadata]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage example (and I guess the type) will probably make this clear, but it'd be good to clarify whether this is the flat or nested representation. IIUC, it's the flat representation so the keys are like ["group/x", "group/y", ...].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the exact syntax of whether or not leading or trailing slashes are expected would be helpful too.

Groups that are implicitly defined by the input will be created as needed.

This function takes a parsed hierarchy dictionary and creates all the nodes in the hierarchy
concurrently. Arrays and Groups are yielded in the order they are created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the creation order deterministic? If not, then perhaps state that the order isn't guaranteed.

) -> Iterator[Group | Array]:
"""Create a collection of arrays and / or groups concurrently.

Note: no attempt is made to validate that these arrays and / or groups collectively form a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the main / only difference between create_nodes and create_hierarchy?

Copy link
Member

@TomNicholas TomNicholas Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, we could just use create_nodes alone.

IIUC the advantage of create_hierarchy is "safety" in the public zarr API. But to support writing multiple arbitrary new arrays/groups to an existing store concurrently requires the generality of create_nodes, so we need that one, and we hence have some "unsafe" public zarr API regardless.

I'm not too worried about using "unsafe" API in xarray because DataTree should prevent users even creating DataTrees with invalid zarr group hierarchies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the main / only difference between create_nodes and create_hierarchy?

yes. create_hierarchy attempts to model the rules of the zarr spec, and so it will not take an input like {'a': ArrayMetadata, 'a/b': ArrayMetadata}, which would nest an array inside another array. Whereas create_nodes doesn't do any input checking at all. it just creates nodes.

store: Store,
path: str,
nodes: dict[str, GroupMetadata | ArrayV2Metadata | ArrayV3Metadata],
overwrite: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that overwrite is undocumented here. In other functions it'd documented as

Whether to overwrite existing nodes. Default is ``False``.

Could you update that description to say what happens when an existing node is found with overwrite=False? Is an error raised, or is the node not updated?

@@ -57,3 +57,10 @@ class NodeTypeValidationError(MetadataValidationError):
This can be raised when the value is invalid or unexpected given the context,
for example an 'array' node when we expected a 'group'.
"""


class RootedHierarchyError(BaseZarrError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this name a bit confusing, but I might not understand the context.

If we're saying "you've tried to create a root, but it already exists" then perhaps a RootExistsError? But I think it's actually more like, you've tried to insert a root at a level below the root? So maybe something like RootAsChildError, NestedRootError, ChildRootError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

class NestedRootError(BaseZarrError):
    """
    Exception raised when attempting to create a root node relative to a pre-existing root node.
    """

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants