You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The documentation of CUPYNUMERIC_MIN_GPU_CHUNK (and friends) implies that it is a minimum size, i.e., arrays of at least this size will go through Legate. Note also that there are three variables, one each for GPU, CPU and OpenMP, and the documentation implies that each variable is used for its respective kind.
Legate will fall back to vanilla NumPy when handling arrays smaller
than this, rather than attempt to accelerate using OpenMP, as the
offloading overhead would likely not be offset by the accelerated
operation code.
This is a read-only environment variable setting used by the runtime.
""",
)
(Scroll down for the equivalent variables for CPU, OpenMP.)
There are a couple divergences in the actual behavior of cuPyNumeric as compared to each of these variables:
First, cuPyNumeric only ever reads one variable in a given run. If the machine has any GPUs, it uses the GPU variable for everything. Same with OpenMP. The CPU value is used only as a fallback if the machine has no GPUs and no OpenMP procs.
if (machine.count(legate::mapping::TaskTarget::GPU) > 0) {
return min_gpu_chunk;
}
if (machine.count(legate::mapping::TaskTarget::OMP) > 0) {
return min_omp_chunk;
}
return min_cpu_chunk;
This means that a user expecting to set, say, CUPYNUMERIC_MIN_CPU_CHUNK in isolation to control the minimum CPU chunk, will have their setting ignored if they run on a machine that happens to have a GPU.
Second, the variable is documented as a minimum, but in practice it is used as a threshold. Suppose all three variables are set to N. The way the documentation, one would expect an array of size N to be parallelized. Instead, only arrays of size N+1 are parallelized. You can see the comparison in is_eager_shape is done with <=, but eager is the negation of parallel, so to match the public definition this would need to be <:
This is problematic if the user has set the variables to the specific N that they know their data will be, and thus causes their code not be parallelized even though it meets the minimum size requirement.
The text was updated successfully, but these errors were encountered:
I've been thinking about this some more. Based on my current understanding, not only do the variables not currently match the documentation, but it is actually not possible to implement them in such a way as to match their documentation/naming. Moreover it is confusing because the equivalent LEGATE_MIN_*_CHUNK variables actually do match their documentation, so someone coming from Legate would be even more confused.
Therefore, I propose that all current CUPYNUMERIC_MIN_*_CHUNK should be removed and replaced with something different.
I suggest the new variable should be called CUPYNUMERIC_MAX_EAGER_VOLUME to match how the code is written, and because that is really the only way the code can be written. The eager optimization applies before cuPyNumeric has any interaction with Legate at all. Therefore, this isn't really a chunk size, nor does it have anything to do with parallelism per se. CUPYNUMERIC_MIN_CHUNK (without the processor kind) doesn't really make sense, as this is not the size of a chunk that will be assigned to a processor. This is the size below which you don't bother doing any parallel analysis at all and just keep everything local.
It also doesn't make sense to tie the cuPyNumeric variable to a specific processor kind, because even if Legate has GPUs somewhere in the machine model, doesn't mean Legate must use GPUs for all operations. An array that is too small to execute on GPUs might still make sense to parallelize over CPUs, but the current cuPyNumeric implementation of these variables would lift the minimum when any GPUs exist at all, causing those operations to never be exposed to Legate in the first place.
It's also important to differentiate this variable from the LEGATE_MIN_*_CHUNK variables because it really serves a different purpose. It might make sense to run an operation on 1 GPU, but not distribute it over multiple GPUs. Such an operation must still be exposed to Legate. Therefore it makes sense to tune the cuPyNumeric variable differently than the Legate one, because you might want to expose Legate arrays that you wouldn't necessarily distribute.
The documentation of
CUPYNUMERIC_MIN_GPU_CHUNK
(and friends) implies that it is a minimum size, i.e., arrays of at least this size will go through Legate. Note also that there are three variables, one each for GPU, CPU and OpenMP, and the documentation implies that each variable is used for its respective kind.cupynumeric/cupynumeric/settings.py
Lines 106 to 152 in 06244e4
(Scroll down for the equivalent variables for CPU, OpenMP.)
There are a couple divergences in the actual behavior of cuPyNumeric as compared to each of these variables:
First, cuPyNumeric only ever reads one variable in a given run. If the machine has any GPUs, it uses the GPU variable for everything. Same with OpenMP. The CPU value is used only as a fallback if the machine has no GPUs and no OpenMP procs.
cupynumeric/src/cupynumeric/runtime.cc
Lines 176 to 182 in 06244e4
This means that a user expecting to set, say,
CUPYNUMERIC_MIN_CPU_CHUNK
in isolation to control the minimum CPU chunk, will have their setting ignored if they run on a machine that happens to have a GPU.Second, the variable is documented as a minimum, but in practice it is used as a threshold. Suppose all three variables are set to
N
. The way the documentation, one would expect an array of sizeN
to be parallelized. Instead, only arrays of sizeN+1
are parallelized. You can see the comparison inis_eager_shape
is done with<=
, buteager
is the negation of parallel, so to match the public definition this would need to be<
:cupynumeric/cupynumeric/runtime.py
Lines 543 to 544 in 06244e4
This is problematic if the user has set the variables to the specific
N
that they know their data will be, and thus causes their code not be parallelized even though it meets the minimum size requirement.The text was updated successfully, but these errors were encountered: