Interpretation of `CUPYNUMERIC_MIN_GPU_CHUNK` does not match documentation #1177

elliottslaughter · 2025-03-22T00:08:12Z

The documentation of CUPYNUMERIC_MIN_GPU_CHUNK (and friends) implies that it is a minimum size, i.e., arrays of at least this size will go through Legate. Note also that there are three variables, one each for GPU, CPU and OpenMP, and the documentation implies that each variable is used for its respective kind.

cupynumeric/cupynumeric/settings.py

Lines 106 to 152 in 06244e4

    
               min_gpu_chunk: EnvOnlySetting[int] = EnvOnlySetting( 
        
                   "min_gpu_chunk", 
        
                   "CUPYNUMERIC_MIN_GPU_CHUNK", 
        
                   default=65536,  # 1 << 16 
        
                   test_default=2, 
        
                   convert=convert_int, 
        
                   help=""" 
        
                   Legate will fall back to vanilla NumPy when handling arrays smaller 
        
                   than this, rather than attempt to accelerate using GPUs, as the 
        
                   offloading overhead would likely not be offset by the accelerated 
        
                   operation code. 
        
                   This is a read-only environment variable setting used by the runtime. 
        
                   """, 
        
               ) 
        
               min_cpu_chunk: EnvOnlySetting[int] = EnvOnlySetting( 
        
                   "min_cpu_chunk", 
        
                   "CUPYNUMERIC_MIN_CPU_CHUNK", 
        
                   default=1024,  # 1 << 10 
        
                   test_default=2, 
        
                   convert=convert_int, 
        
                   help=""" 
        
                   Legate will fall back to vanilla NumPy when handling arrays smaller 
        
                   than this, rather than attempt to accelerate using native CPU code, as 
        
                   the offloading overhead would likely not be offset by the accelerated 
        
                   operation code. 
        
                   This is a read-only environment variable setting used by the runtime. 
        
                   """, 
        
               ) 
        
               min_omp_chunk: EnvOnlySetting[int] = EnvOnlySetting( 
        
                   "min_omp_chunk", 
        
                   "CUPYNUMERIC_MIN_OMP_CHUNK", 
        
                   default=8192,  # 1 << 13 
        
                   test_default=2, 
        
                   convert=convert_int, 
        
                   help=""" 
        
                   Legate will fall back to vanilla NumPy when handling arrays smaller 
        
                   than this, rather than attempt to accelerate using OpenMP, as the 
        
                   offloading overhead would likely not be offset by the accelerated 
        
                   operation code. 
        
                   This is a read-only environment variable setting used by the runtime. 
        
                   """, 
        
               )

(Scroll down for the equivalent variables for CPU, OpenMP.)

There are a couple divergences in the actual behavior of cuPyNumeric as compared to each of these variables:

First, cuPyNumeric only ever reads one variable in a given run. If the machine has any GPUs, it uses the GPU variable for everything. Same with OpenMP. The CPU value is used only as a fallback if the machine has no GPUs and no OpenMP procs.

cupynumeric/src/cupynumeric/runtime.cc

Lines 176 to 182 in 06244e4

    
           if (machine.count(legate::mapping::TaskTarget::GPU) > 0) { 
        
             return min_gpu_chunk; 
        
           } 
        
           if (machine.count(legate::mapping::TaskTarget::OMP) > 0) { 
        
             return min_omp_chunk; 
        
           } 
        
           return min_cpu_chunk;

This means that a user expecting to set, say, CUPYNUMERIC_MIN_CPU_CHUNK in isolation to control the minimum CPU chunk, will have their setting ignored if they run on a machine that happens to have a GPU.

Second, the variable is documented as a minimum, but in practice it is used as a threshold. Suppose all three variables are set to N. The way the documentation, one would expect an array of size N to be parallelized. Instead, only arrays of size N+1 are parallelized. You can see the comparison in is_eager_shape is done with <=, but eager is the negation of parallel, so to match the public definition this would need to be <:

cupynumeric/cupynumeric/runtime.py

Lines 543 to 544 in 06244e4

    
           # Otherwise, see if the volume is large enough 
        
           return volume <= self.max_eager_volume

This is problematic if the user has set the variables to the specific N that they know their data will be, and thus causes their code not be parallelized even though it meets the minimum size requirement.

The text was updated successfully, but these errors were encountered:

elliottslaughter · 2025-03-24T21:06:59Z

I've been thinking about this some more. Based on my current understanding, not only do the variables not currently match the documentation, but it is actually not possible to implement them in such a way as to match their documentation/naming. Moreover it is confusing because the equivalent LEGATE_MIN_*_CHUNK variables actually do match their documentation, so someone coming from Legate would be even more confused.

Therefore, I propose that all current CUPYNUMERIC_MIN_*_CHUNK should be removed and replaced with something different.

I suggest the new variable should be called CUPYNUMERIC_MAX_EAGER_VOLUME to match how the code is written, and because that is really the only way the code can be written. The eager optimization applies before cuPyNumeric has any interaction with Legate at all. Therefore, this isn't really a chunk size, nor does it have anything to do with parallelism per se. CUPYNUMERIC_MIN_CHUNK (without the processor kind) doesn't really make sense, as this is not the size of a chunk that will be assigned to a processor. This is the size below which you don't bother doing any parallel analysis at all and just keep everything local.

It also doesn't make sense to tie the cuPyNumeric variable to a specific processor kind, because even if Legate has GPUs somewhere in the machine model, doesn't mean Legate must use GPUs for all operations. An array that is too small to execute on GPUs might still make sense to parallelize over CPUs, but the current cuPyNumeric implementation of these variables would lift the minimum when any GPUs exist at all, causing those operations to never be exposed to Legate in the first place.

It's also important to differentiate this variable from the LEGATE_MIN_*_CHUNK variables because it really serves a different purpose. It might make sense to run an operation on 1 GPU, but not distribute it over multiple GPUs. Such an operation must still be exposed to Legate. Therefore it makes sense to tune the cuPyNumeric variable differently than the Legate one, because you might want to expose Legate arrays that you wouldn't necessarily distribute.

elliottslaughter mentioned this issue Mar 24, 2025

Synchronize CUPYNUMERIC_MIN_CPU_CHUNK with LEGATE_MIN_CPU_CHUNK and friends #1176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpretation of `CUPYNUMERIC_MIN_GPU_CHUNK` does not match documentation #1177

Interpretation of `CUPYNUMERIC_MIN_GPU_CHUNK` does not match documentation #1177

elliottslaughter commented Mar 22, 2025

elliottslaughter commented Mar 24, 2025 •

edited

Loading

Interpretation of CUPYNUMERIC_MIN_GPU_CHUNK does not match documentation #1177

Interpretation of CUPYNUMERIC_MIN_GPU_CHUNK does not match documentation #1177

Comments

elliottslaughter commented Mar 22, 2025

elliottslaughter commented Mar 24, 2025 • edited Loading

Interpretation of `CUPYNUMERIC_MIN_GPU_CHUNK` does not match documentation #1177

Interpretation of `CUPYNUMERIC_MIN_GPU_CHUNK` does not match documentation #1177

elliottslaughter commented Mar 24, 2025 •

edited

Loading