[FEA]: Redesign default tuning #3570
Labels
cub
For all items related to CUB
feature request
New feature or request.
thrust
For all items related to Thrust.
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
Some of the parameters in parallel algorithms affect performance characteristics but not functional correctness.
Thread block size, grain size (number of items per thread), cache modifiers etc. are all examples of such parameters.
A set of these parameters composes a tuning:
cccl/cub/cub/device/dispatch/tuning/tuning_scan.cuh
Lines 216 to 219 in 28974d0
Today, we are tuning parallel algorithms for specific compile-time workloads.
For example, prefix sum of
int64_t
withcuda::std::plus
will be tuned differently compared to prefix sum ofuint8_t
withcuda::std::minimum
.This approach leads to CUB not applying any of the tunings when, say, binary operator or the value types are not known.
This leads to suboptimal performance in cases that represent a slight deviation from trivial binary operators that we know about.
Initially, this approach was chosen as a safe default.
We can't introspect the incoming binary operator.
If this binary operator is complex enough, or leads to significant load imbalance, increased block and grain sizes that suit trivial cases can lead to performance regressions.
Our current intuition is that arithmetically intense operators are rare compared to trivial ones. For instance:
cccl/cub/benchmarks/bench/transform/babelstream2.cu
Lines 44 to 46 in 0b5844f
cccl/cub/benchmarks/bench/transform/babelstream3.cu
Lines 36 to 38 in 0b5844f
In the cases above, we'd prefer to have some tuning instead of pessimizing performance.
This issue is meant as a discussion / design point on our options in avoiding this pessimization.
Describe the solution you'd like
We should revisit our approach on tuning unknown workloads.
There are a few options that we have considered.
Opt-In Scheme
We could give users an opt-in mechanism to let CUB know that given value type / operator combination are trivial:
Then, on the tuning end, we'd relax tuning from:
to something like:
Pros
Cons
Opt-Out Scheme
Alternatively, we could apply existing tunings by default and give users a way to opt-out of these tunings:
Pros
#pragma unroll
from CUB kernelsCons
I incline towards opt-out scheme.
Before closing this issue, we should allow for some time to gather more opinions.
After that, the issue can be closed by a design of function wrapper with an example of opt-in / opt-out (depending on what we choose) behavior in one of CUB algorithms.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: