Jax integration #304

alxmrs · 2023-09-10T05:46:45Z

Can the core array API ops of cubed be implemented in jax, s.t. everything easily compiles to accelerators? Could this solve the common pain point of running out of GPU memory? How would other constraints (GPU bandwidth limits) be handled? What is the ideal distributed runtime environment to make the most of this? Could spot GPU instances be used (serverless accelerators)?

tomwhite · 2023-09-11T09:32:23Z

Thanks @alxmrs for opening this issue. I'm not familiar enough with JAX or GPUs to answer these questions, but I'd be happy to support or discuss an initiative in this direction. Is there a small piece of work that you have in mind that could be used to explore this?

alxmrs · 2023-09-19T18:47:52Z

The Jax docs may provide a few good toy examples useful to validate this idea.

Check out this tutorial: https://jax.readthedocs.io/en/latest/jax-101/06-parallelism.html

This tutorial on distributing computation on a pool of TPUs (specifically, the neural net section) may be of interest, too:

https://jax.readthedocs.io/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html#examples-neural-networks

High level goals for Jax + Cubed may be to make managing GPU memory effortless:

https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html#common-causes-of-oom-failures
E.g.: could we do an FFT on a 100 GiB (TiB? PiB??) array? FFT of very large arrays (>100GB) jax-ml/jax#13842

tomwhite · 2023-09-20T09:34:30Z

Thanks for the pointers @alxmrs.

Thinking about how Cubed might hook into this, the main idea in Cubed is that every array is backed by Zarr, and an operation maps one Zarr array to another, by working in parallel on chunks (see https://tom-e-white.com/cubed/cubed.slides.html#/1).

Every operation is ultimately expressible as a blockwise operation (or a rechunk operation, but let's ignore that here), which ends up in the apply_blockwise function:

https://github.com/tomwhite/cubed/blob/0d13e4f2b12c22d1c41b9f4ea693266b21d808d0/cubed/primitive/blockwise.py#L53-L75

The key parts are:

Reading each arg from Zarr into (CPU) memory (line 67),
Invoking the function on the args (line 70), and
Writing the result from (CPU) memory back to Zarr (line 73 or 75)

To change this to use JAX, we'd have to 1. read from Zarr into JAX arrays, 2. invoke the relevant JAX function on the arrays, 3. write the resulting JAX array to Zarr.

In fact, there might not be anything to do for 2., since you could call cubed.map_blocks with a JAX function.

This might be enough for the FFT example, although I'm a bit hazy on if any post-processing is needed on the chunked (sharded) output.

A final thought. Is KvikIO, which does direct Zarr-GPU IO, related to this?

TomNicholas · 2024-02-09T19:01:04Z

read from Zarr into JAX arrays

A final thought. Is KvikIO, which does direct Zarr-GPU IO, related to this?

Reading the xarray blog post on this that @dcherian and @weiji14 wrote, it seems they used a Zarr store provided by kvikio. I expect cubed could use this to load data from Zarr direct to GPU in the form of a cupy array, which would be cool. (Or even you could probably use the xarray backend they wrote alongside cubed-xarray to achieve this.)

I tried to find if there was anything similar for JAX, but didn't see anything (only this jax-ml/jax#17534). Writing from JAX to tensorstore was done for checkpointing language models (https://blog.research.google/2022/09/tensorstore-for-high-performance.html?m=1) but one would have thought that making tensorstore return JAX arrays directly would have been tried...

weiji14 · 2024-02-11T22:16:07Z

KvikIO loads data into cupy, but it should technically be possible to zero-copy cupy arrays to JAX, Pytorch, or any array library that implements conversion via dlpack or the __cuda_array_interface__ protocol. It looks like JAX supports this already (jax-ml/jax#1100)? But I haven't tried this end to end yet. There's also NVIDIA DALI which seems to work with JAX (https://docs.nvidia.com/deeplearning/dali/archives/dali_1_32_0/user-guide/docs/plugins/jax_tutorials.html#jax), but the interface is a little less convenient since you need to setup a pipeline. Generally, the integration between RAPIDS AI libraries (which builds on cupy) is a bit better on the Pytorch side with the RAPIDS Memory Manager (https://github.com/rapidsai/rmm/blob/branch-24.04/README.md#using-rmm-with-third-party-libraries).

alxmrs · 2024-07-23T13:52:39Z

Hey @tomwhite, I have a question for you: to run jax arrays on accelerators (M1+ chips, GPUs, TPUs, etc.), someone needs to call jax.jit: https://jax.readthedocs.io/en/latest/quickstart.html#just-in-time-compilation-with-jax-jit

Where is a good place to make this kind of call within Cubed? Is this something that should be handled by an Executor (this seems not so ideal)?

(Here's some more-in-depth docs on Jax's jit: https://jax.readthedocs.io/en/latest/jit-compilation.html#jit-compilation).

A related concern that I haven't properly figured out yet: How should this intersect with devices and sharding?
https://jax.readthedocs.io/en/latest/sharded-computation.html

tomwhite · 2024-07-23T15:17:22Z

Where is a good place to make this kind of call within Cubed? Is this something that should be handled by an Executor (this seems not so ideal)?

Possibly as a part of DAG finalization, after (Cubed) optimization has been run. Although the function you want to jit will be the function in BlockwiseSpec:

cubed/cubed/primitive/blockwise.py

Lines 47 to 73 in 59c593d

    
           class BlockwiseSpec: 
        
               """Specification for how to run blockwise on an array. 
        
               This is similar to ``CopySpec`` in rechunker. 
        
               Attributes 
        
               ---------- 
        
               key_function : Callable 
        
                   A function that maps an output chunk key to one or more input chunk keys. 
        
               function : Callable 
        
                   A function that maps input chunks to an output chunk. 
        
               function_nargs: int 
        
                   The number of array arguments that ``function`` takes. 
        
               num_input_blocks: Tuple[int, ...] 
        
                   The number of input blocks read from each input array. 
        
               reads_map : Dict[str, CubedArrayProxy] 
        
                   Read proxy dictionary keyed by array name. 
        
               write : CubedArrayProxy 
        
                   Write proxy with an ``array`` attribute that supports ``__setitem__``. 
        
               """ 
        
               key_function: Callable[..., Any] 
        
               function: Callable[..., Any] 
        
               function_nargs: int 
        
               num_input_blocks: Tuple[int, ...] 
        
               reads_map: Dict[str, CubedArrayProxy] 
        
               write: CubedArrayProxy

What's the simplest possible example to start with?

alxmrs · 2024-07-23T22:01:54Z

Thanks for your suggestion, Tom. I've prototyped something here: alxmrs#1

For now, it looks like I need to work on landing the M1 PR before I can take this any further.

tomwhite · 2024-07-24T08:28:34Z

Nice!

tomwhite · 2024-07-30T15:16:31Z

I think compiling the (Cubed optimized) blockwise functions using AOT compilation (as you mentioned in #490 (comment)), and then exporting them so they can run in other processes (https://jax.readthedocs.io/en/latest/export/export.html) may be the way to go. Perhaps this is worth trying on CPU first.

tomwhite mentioned this issue Oct 9, 2023

Use array API for internal array operations #315

Closed

tomwhite mentioned this issue Nov 13, 2023

Generalise storage layer #322

Closed

tomwhite mentioned this issue Feb 9, 2024

Run tests using JAX as the backend array API (on CPU) #374

Merged

This was referenced Jun 24, 2024

Ray Executor #488

Open

Is there a parallel between tile GPU/TPU kernes and Cubed chunks? #490

Open

This was referenced Jul 20, 2024

Adding Jax tests for the M1 mac. #508

Open

Consider adopting a stateless PRNG API #509

Open

alxmrs mentioned this issue Aug 3, 2024

Adding compile_function as execute option. #536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jax integration #304

Jax integration #304

alxmrs commented Sep 10, 2023

tomwhite commented Sep 11, 2023

alxmrs commented Sep 19, 2023

tomwhite commented Sep 20, 2023

TomNicholas commented Feb 9, 2024

weiji14 commented Feb 11, 2024

alxmrs commented Jul 23, 2024

tomwhite commented Jul 23, 2024

alxmrs commented Jul 23, 2024

tomwhite commented Jul 24, 2024

tomwhite commented Jul 30, 2024

Jax integration #304

Jax integration #304

Comments

alxmrs commented Sep 10, 2023

tomwhite commented Sep 11, 2023

alxmrs commented Sep 19, 2023

tomwhite commented Sep 20, 2023

TomNicholas commented Feb 9, 2024

weiji14 commented Feb 11, 2024

alxmrs commented Jul 23, 2024

tomwhite commented Jul 23, 2024

alxmrs commented Jul 23, 2024

tomwhite commented Jul 24, 2024

tomwhite commented Jul 30, 2024