[FEA]: Redesign default tuning #3570

gevtushenko · 2025-01-28T18:42:43Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUB

Is your feature request related to a problem? Please describe.

Some of the parameters in parallel algorithms affect performance characteristics but not functional correctness.
Thread block size, grain size (number of items per thread), cache modifiers etc. are all examples of such parameters.
A set of these parameters composes a tuning:

cccl/cub/cub/device/dispatch/tuning/tuning_scan.cuh

Lines 216 to 219 in 28974d0

    
           template <class T> struct sm90_tuning<T, primitive_op::yes, primitive_accum::yes, accum_size::_1> : sm90_tuning_vals<T, 192, 22, 168, 1140> {}; 
        
           template <class T> struct sm90_tuning<T, primitive_op::yes, primitive_accum::yes, accum_size::_2> : sm90_tuning_vals<T, 512, 12, 376, 1125> {}; 
        
           template <class T> struct sm90_tuning<T, primitive_op::yes, primitive_accum::yes, accum_size::_4> : sm90_tuning_vals<T, 128, 24, 648, 1245> {}; 
        
           template <class T> struct sm90_tuning<T, primitive_op::yes, primitive_accum::yes, accum_size::_8> : sm90_tuning_vals<T, 224, 24, 632, 1290> {};

Today, we are tuning parallel algorithms for specific compile-time workloads.
For example, prefix sum of int64_t with cuda::std::plus will be tuned differently compared to prefix sum of uint8_t with cuda::std::minimum.

This approach leads to CUB not applying any of the tunings when, say, binary operator or the value types are not known.
This leads to suboptimal performance in cases that represent a slight deviation from trivial binary operators that we know about.
Initially, this approach was chosen as a safe default.
We can't introspect the incoming binary operator.
If this binary operator is complex enough, or leads to significant load imbalance, increased block and grain sizes that suit trivial cases can lead to performance regressions.

Our current intuition is that arithmetically intense operators are rare compared to trivial ones. For instance:

cccl/cub/benchmarks/bench/transform/babelstream2.cu

Lines 44 to 46 in 0b5844f

    
           state, ::cuda::std::tuple{b.begin(), c.begin()}, a.begin(), n, [=] _CCCL_DEVICE(const T& bi, const T& ci) { 
        
             return bi + scalar * ci; 
        
           });

cccl/cub/benchmarks/bench/transform/babelstream3.cu

Lines 36 to 38 in 0b5844f

    
           [=] _CCCL_DEVICE(const T& ai, const T& bi, const T& ci) { 
        
             return ai + bi + scalar * ci; 
        
           },

In the cases above, we'd prefer to have some tuning instead of pessimizing performance.
This issue is meant as a discussion / design point on our options in avoiding this pessimization.

Describe the solution you'd like

We should revisit our approach on tuning unknown workloads.
There are a few options that we have considered.

Opt-In Scheme

We could give users an opt-in mechanism to let CUB know that given value type / operator combination are trivial:

cub::DeviceReduce(..., 
  proclaim_trivial_case(
    []__host__ __device__(char a, char b) -> char { return a + b }
  ));

Then, on the tuning end, we'd relax tuning from:

template <class T> 
struct sm90_tuning<T, primitive_op::yes, primitive_accum::yes, accum_size::_1> : sm90_tuning_vals<T, 192, 22, 168, 1140> {};

to something like:

template <class T> 
struct sm90_tuning<T, accum_size::_1> : sm90_tuning_vals<T, 192, 22, 168, 1140> {};

Pros

User gains access to the tunings we have for trivial cases
Heavy-weight operations do not experience performance regressions
Annotation can propagate through the stack. Thrust user can annotate a function and this annotation would go all the way to CUB.

Cons

Manual annotation means that some users will leave this option on the table

Opt-Out Scheme

Alternatively, we could apply existing tunings by default and give users a way to opt-out of these tunings:

cub::DeviceReduce(..., 
  proclaim_heavy_weight_case(
    []__host__ __device__(cuda::std::complex<float> a, cuda::std::complex<float> b) -> char { 
      return heavy_weight_comparison(a, b) ? a : b; }
  ));

Pros

Common use cases will likely experience performance improvements
We'll be able to recognize heavy-weight workloads, and, say, remove #pragma unroll from CUB kernels
We'll be able to go beyond tuning, and select completely different (work-efficient) implementations of parallel algorithms for this case
Annotation can propagate through the stack. Thrust user can annotate a function and this annotation would go all the way to CUB.

Cons

Users providing heavy-weight operations will likely experience perf regressions and will have to manually annotate their code

I incline towards opt-out scheme.

Before closing this issue, we should allow for some time to gather more opinions.
After that, the issue can be closed by a design of function wrapper with an example of opt-in / opt-out (depending on what we choose) behavior in one of CUB algorithms.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

gevtushenko · 2025-01-28T18:44:52Z

@elstehle, @bernhardmgruber, @shwina, @jrhemstad, @oleksandr-pavlyk, @griwes I'd be interested to hear your opinion if you don't think opt-out scheme is a good idea.

gevtushenko · 2025-01-28T18:50:57Z

Related issue: #754

ahendriksen · 2025-01-30T09:36:11Z

The new thrust::transform does not use unrolling for performance and can auto-tune based on occupancy. It is not a solution that can be easily applied to all other algorithms, but I wanted you to be aware.

gevtushenko added the feature request New feature or request. label Jan 28, 2025

github-project-automation bot added this to CCCL Jan 28, 2025

github-project-automation bot moved this to Todo in CCCL Jan 28, 2025

gevtushenko added thrust For all items related to Thrust. cub For all items related to CUB labels Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Redesign default tuning #3570

[FEA]: Redesign default tuning #3570

gevtushenko commented Jan 28, 2025

gevtushenko commented Jan 28, 2025

gevtushenko commented Jan 28, 2025

ahendriksen commented Jan 30, 2025

[FEA]: Redesign default tuning #3570

[FEA]: Redesign default tuning #3570

Comments

gevtushenko commented Jan 28, 2025

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Opt-In Scheme

Pros

Cons

Opt-Out Scheme

Pros

Cons

Describe alternatives you've considered

Additional context

gevtushenko commented Jan 28, 2025

gevtushenko commented Jan 28, 2025

ahendriksen commented Jan 30, 2025