-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icechunk store #633
base: main
Are you sure you want to change the base?
Icechunk store #633
Conversation
*, | ||
sources: Union["Array", Sequence["Array"]], | ||
targets: List[zarr.Array], | ||
executor=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could definitely address this later but regions
is quite an important kwarg.
if self.store is None: | ||
self.store = store | ||
else: | ||
self.store.merge(store.change_set_bytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paraseba shall we commit to icechunk.distributed.merge_stores
as public API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite It looks like callbacks are "accumulated" on every worker. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcherian My only concern with it is the implementation of merge_stores
is probably very slow, using no parallelism, but I don't see any issues with making it public
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right we can fix that, but note that dask and presumably cubed are accumulating on remote workers already, so there is already some parallelism in how it is used.
Moreover it'd be nice not to have store.change_set_bytes
be the public API :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like callbacks are "accumulated" on every worker. Is that right?
No, they are being accumulated on the client so there is no parallelism. I don't know what Icechunk is doing here, but would it be possible to merge a batch in one go rather than one at a time? Could that be more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but would it be possible to merge a batch in one go rather than one at a time
we'll have to build some rust API, we'll get to this eventually.
No, they are being accumulated on the client so there is no parallelism.
ah ok. In that case, how about calling reduction
on each array that was written? That way you parallelize the merge across blocks for each array, and then the only serial bit is the merging across arrays, which will be a lot smaller. I considered this approach for dask, but then just wrote out a tree reduction across all chunks.
EDIT: or is the reduction
approach not viable because you need to serialize to Zarr at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or is the
reduction
approach not viable because you need to serialize to Zarr at some point?
Yes, that's basically it - Cubed also separates the data paths for array manipulations (contents of the blocks) from the metadata operations (block IDs and - for Icechunk - the changesets). So I think merging in batches would be more feasible.
we'll have to build some rust API, we'll get to this eventually.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's basically it
:( I was afraid so.
Fixes #628
This is a proof of concept/draft for a
store_icechunk
function that uses Cubed callbacks to do the Icechunk store merging. The tests pass locally, but not in CI yet. I'm not sure where this code should live.