Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing the shuffle #326

Open
TomNicholas opened this issue Nov 17, 2023 · 3 comments
Open

Optimizing the shuffle #326

TomNicholas opened this issue Nov 17, 2023 · 3 comments
Labels

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Nov 17, 2023

Cubed currently always implements the shuffle operation as an all-to-all rechunking using the algorithm from rechunker. This creates an intermediate persistent Zarr store, and requires all chunks to be written then all chunks to be read. Can we do better?

We could consider using a different storage service, such as a different Zarr reader, Google Tensorstore (see #187), or maybe redis. These are all still fundamentally the same shuffle operation though.

Another idea is to narrow the number of situations in which we actually need a full rechunk. There are some trivial cases (see #256), but there might be others. @dcherian had an idea for representing rechunk operations as blockwise somehow that I would like to hear more about!

Pedro Lopez pointed us towards the Primula paper, saying it implements an efficient serverless shuffle (for a big sorting operation). I'm not sure I understand it well enough yet, but my impression is that it's actually basically the same save-everything-to-intermediate-blob-storage idea that we're already using, plus some more minor optimizations.

EDIT: Correct link to Primula paper

@hammer
Copy link

hammer commented Nov 21, 2023

I think the Primula paper link should be https://dl.acm.org/doi/10.1145/3429357.3430522

@TomNicholas
Copy link
Member Author

Oh yes! Thanks for spotting that mistake @hammer

@TomNicholas
Copy link
Member Author

See also #502 for another suggestion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants