Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icechunk store #633

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Icechunk store #633

wants to merge 2 commits into from

Conversation

tomwhite
Copy link
Member

@tomwhite tomwhite commented Dec 4, 2024

Fixes #628

This is a proof of concept/draft for a store_icechunk function that uses Cubed callbacks to do the Icechunk store merging. The tests pass locally, but not in CI yet. I'm not sure where this code should live.

@tomwhite tomwhite mentioned this pull request Dec 4, 2024
*,
sources: Union["Array", Sequence["Array"]],
targets: List[zarr.Array],
executor=None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could definitely address this later but regions is quite an important kwarg.

if self.store is None:
self.store = store
else:
self.store.merge(store.change_set_bytes())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite It looks like callbacks are "accumulated" on every worker. Is that right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian My only concern with it is the implementation of merge_stores is probably very slow, using no parallelism, but I don't see any issues with making it public

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right we can fix that, but note that dask and presumably cubed are accumulating on remote workers already, so there is already some parallelism in how it is used.

Moreover it'd be nice not to have store.change_set_bytes be the public API :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like callbacks are "accumulated" on every worker. Is that right?

No, they are being accumulated on the client so there is no parallelism. I don't know what Icechunk is doing here, but would it be possible to merge a batch in one go rather than one at a time? Could that be more efficient?

Copy link

@dcherian dcherian Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but would it be possible to merge a batch in one go rather than one at a time

we'll have to build some rust API, we'll get to this eventually.

No, they are being accumulated on the client so there is no parallelism.

ah ok. In that case, how about calling reduction on each array that was written? That way you parallelize the merge across blocks for each array, and then the only serial bit is the merging across arrays, which will be a lot smaller. I considered this approach for dask, but then just wrote out a tree reduction across all chunks.

EDIT: or is the reduction approach not viable because you need to serialize to Zarr at some point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is the reduction approach not viable because you need to serialize to Zarr at some point?

Yes, that's basically it - Cubed also separates the data paths for array manipulations (contents of the blocks) from the metadata operations (block IDs and - for Icechunk - the changesets). So I think merging in batches would be more feasible.

we'll have to build some rust API, we'll get to this eventually.

+1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's basically it

:( I was afraid so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Icechunk integration
3 participants