Multi-stage rechunk #650

tomwhite · 2024-12-17T16:48:13Z

The rechunker algorithm has support for multi-stage rechunking, which reduces the number of IO operations.

It would be good to enable this in Cubed rechunk. The simplest way to do this would be to expose the min_mem parameter, but it might be interesting to see if there is a way of automatically finding the number of stages that optimizes for something we care about (e.g. min IO ops).

The text was updated successfully, but these errors were encountered:

tomwhite · 2025-03-17T10:42:09Z

I rechunked one variable (1.5TB) of the ERA5 dataset in 8m 48s running on AWS Lambda using the code in cubed-dev/cubed-benchmarks#29.

The changes that were important were (see #700):

Understanding the extra memory required by running memray on S3 Zarr code: https://github.com/tomwhite/memray-array
Changing the rechunking algorithm to use a 7x multiplier to take account of all the copies needed
Using a runtime image of 3.5GB (increased from 2GB) to provide enough memory (rather than using less at the expense of more rechunk stages).

dcherian · 2025-03-17T13:17:28Z

Amazing!

tomwhite added the primitive label Dec 17, 2024

tomwhite mentioned this issue Mar 4, 2025

Multistage rechunking #700

Merged

tomwhite closed this as completed in #700 Mar 12, 2025

tomwhite mentioned this issue Mar 18, 2025

Write docs explaining how multi-stage rechunking works #707

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-stage rechunk #650

Multi-stage rechunk #650

tomwhite commented Dec 17, 2024

tomwhite commented Mar 17, 2025 •

edited

Loading

Uh oh!

dcherian commented Mar 17, 2025

Uh oh!

Multi-stage rechunk #650

Multi-stage rechunk #650

Comments

tomwhite commented Dec 17, 2024

tomwhite commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian commented Mar 17, 2025

Uh oh!

tomwhite commented Mar 17, 2025 •

edited

Loading