Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpretation of CUPYNUMERIC_MIN_GPU_CHUNK does not match documentation #1177

Open
elliottslaughter opened this issue Mar 22, 2025 · 1 comment

Comments

@elliottslaughter
Copy link

The documentation of CUPYNUMERIC_MIN_GPU_CHUNK (and friends) implies that it is a minimum size, i.e., arrays of at least this size will go through Legate. Note also that there are three variables, one each for GPU, CPU and OpenMP, and the documentation implies that each variable is used for its respective kind.

min_gpu_chunk: EnvOnlySetting[int] = EnvOnlySetting(
"min_gpu_chunk",
"CUPYNUMERIC_MIN_GPU_CHUNK",
default=65536, # 1 << 16
test_default=2,
convert=convert_int,
help="""
Legate will fall back to vanilla NumPy when handling arrays smaller
than this, rather than attempt to accelerate using GPUs, as the
offloading overhead would likely not be offset by the accelerated
operation code.
This is a read-only environment variable setting used by the runtime.
""",
)
min_cpu_chunk: EnvOnlySetting[int] = EnvOnlySetting(
"min_cpu_chunk",
"CUPYNUMERIC_MIN_CPU_CHUNK",
default=1024, # 1 << 10
test_default=2,
convert=convert_int,
help="""
Legate will fall back to vanilla NumPy when handling arrays smaller
than this, rather than attempt to accelerate using native CPU code, as
the offloading overhead would likely not be offset by the accelerated
operation code.
This is a read-only environment variable setting used by the runtime.
""",
)
min_omp_chunk: EnvOnlySetting[int] = EnvOnlySetting(
"min_omp_chunk",
"CUPYNUMERIC_MIN_OMP_CHUNK",
default=8192, # 1 << 13
test_default=2,
convert=convert_int,
help="""
Legate will fall back to vanilla NumPy when handling arrays smaller
than this, rather than attempt to accelerate using OpenMP, as the
offloading overhead would likely not be offset by the accelerated
operation code.
This is a read-only environment variable setting used by the runtime.
""",
)

(Scroll down for the equivalent variables for CPU, OpenMP.)

There are a couple divergences in the actual behavior of cuPyNumeric as compared to each of these variables:

First, cuPyNumeric only ever reads one variable in a given run. If the machine has any GPUs, it uses the GPU variable for everything. Same with OpenMP. The CPU value is used only as a fallback if the machine has no GPUs and no OpenMP procs.

if (machine.count(legate::mapping::TaskTarget::GPU) > 0) {
return min_gpu_chunk;
}
if (machine.count(legate::mapping::TaskTarget::OMP) > 0) {
return min_omp_chunk;
}
return min_cpu_chunk;

This means that a user expecting to set, say, CUPYNUMERIC_MIN_CPU_CHUNK in isolation to control the minimum CPU chunk, will have their setting ignored if they run on a machine that happens to have a GPU.

Second, the variable is documented as a minimum, but in practice it is used as a threshold. Suppose all three variables are set to N. The way the documentation, one would expect an array of size N to be parallelized. Instead, only arrays of size N+1 are parallelized. You can see the comparison in is_eager_shape is done with <=, but eager is the negation of parallel, so to match the public definition this would need to be <:

# Otherwise, see if the volume is large enough
return volume <= self.max_eager_volume

This is problematic if the user has set the variables to the specific N that they know their data will be, and thus causes their code not be parallelized even though it meets the minimum size requirement.

@elliottslaughter
Copy link
Author

elliottslaughter commented Mar 24, 2025

I've been thinking about this some more. Based on my current understanding, not only do the variables not currently match the documentation, but it is actually not possible to implement them in such a way as to match their documentation/naming. Moreover it is confusing because the equivalent LEGATE_MIN_*_CHUNK variables actually do match their documentation, so someone coming from Legate would be even more confused.

Therefore, I propose that all current CUPYNUMERIC_MIN_*_CHUNK should be removed and replaced with something different.

I suggest the new variable should be called CUPYNUMERIC_MAX_EAGER_VOLUME to match how the code is written, and because that is really the only way the code can be written. The eager optimization applies before cuPyNumeric has any interaction with Legate at all. Therefore, this isn't really a chunk size, nor does it have anything to do with parallelism per se. CUPYNUMERIC_MIN_CHUNK (without the processor kind) doesn't really make sense, as this is not the size of a chunk that will be assigned to a processor. This is the size below which you don't bother doing any parallel analysis at all and just keep everything local.

It also doesn't make sense to tie the cuPyNumeric variable to a specific processor kind, because even if Legate has GPUs somewhere in the machine model, doesn't mean Legate must use GPUs for all operations. An array that is too small to execute on GPUs might still make sense to parallelize over CPUs, but the current cuPyNumeric implementation of these variables would lift the minimum when any GPUs exist at all, causing those operations to never be exposed to Legate in the first place.

It's also important to differentiate this variable from the LEGATE_MIN_*_CHUNK variables because it really serves a different purpose. It might make sense to run an operation on 1 GPU, but not distribute it over multiple GPUs. Such an operation must still be exposed to Legate. Therefore it makes sense to tune the cuPyNumeric variable differently than the Legate one, because you might want to expose Legate arrays that you wouldn't necessarily distribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant