Skip to content

Multi-stage rechunk #650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tomwhite opened this issue Dec 17, 2024 · 2 comments · Fixed by #700
Closed

Multi-stage rechunk #650

tomwhite opened this issue Dec 17, 2024 · 2 comments · Fixed by #700

Comments

@tomwhite
Copy link
Member

The rechunker algorithm has support for multi-stage rechunking, which reduces the number of IO operations.

It would be good to enable this in Cubed rechunk. The simplest way to do this would be to expose the min_mem parameter, but it might be interesting to see if there is a way of automatically finding the number of stages that optimizes for something we care about (e.g. min IO ops).

@tomwhite
Copy link
Member Author

tomwhite commented Mar 17, 2025

I rechunked one variable (1.5TB) of the ERA5 dataset in 8m 48s running on AWS Lambda using the code in cubed-dev/cubed-benchmarks#29.

The changes that were important were (see #700):

  • Understanding the extra memory required by running memray on S3 Zarr code: https://github.com/tomwhite/memray-array
  • Changing the rechunking algorithm to use a 7x multiplier to take account of all the copies needed
  • Using a runtime image of 3.5GB (increased from 2GB) to provide enough memory (rather than using less at the expense of more rechunk stages).

@dcherian
Copy link

Amazing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants