Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Redesign default tuning #3570

Open
1 task done
gevtushenko opened this issue Jan 28, 2025 · 3 comments
Open
1 task done

[FEA]: Redesign default tuning #3570

gevtushenko opened this issue Jan 28, 2025 · 3 comments
Labels
cub For all items related to CUB feature request New feature or request. thrust For all items related to Thrust.

Comments

@gevtushenko
Copy link
Collaborator

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

Some of the parameters in parallel algorithms affect performance characteristics but not functional correctness.
Thread block size, grain size (number of items per thread), cache modifiers etc. are all examples of such parameters.
A set of these parameters composes a tuning:

template <class T> struct sm90_tuning<T, primitive_op::yes, primitive_accum::yes, accum_size::_1> : sm90_tuning_vals<T, 192, 22, 168, 1140> {};
template <class T> struct sm90_tuning<T, primitive_op::yes, primitive_accum::yes, accum_size::_2> : sm90_tuning_vals<T, 512, 12, 376, 1125> {};
template <class T> struct sm90_tuning<T, primitive_op::yes, primitive_accum::yes, accum_size::_4> : sm90_tuning_vals<T, 128, 24, 648, 1245> {};
template <class T> struct sm90_tuning<T, primitive_op::yes, primitive_accum::yes, accum_size::_8> : sm90_tuning_vals<T, 224, 24, 632, 1290> {};

Today, we are tuning parallel algorithms for specific compile-time workloads.
For example, prefix sum of int64_t with cuda::std::plus will be tuned differently compared to prefix sum of uint8_t with cuda::std::minimum.

This approach leads to CUB not applying any of the tunings when, say, binary operator or the value types are not known.
This leads to suboptimal performance in cases that represent a slight deviation from trivial binary operators that we know about.
Initially, this approach was chosen as a safe default.
We can't introspect the incoming binary operator.
If this binary operator is complex enough, or leads to significant load imbalance, increased block and grain sizes that suit trivial cases can lead to performance regressions.

Our current intuition is that arithmetically intense operators are rare compared to trivial ones. For instance:

  1. state, ::cuda::std::tuple{b.begin(), c.begin()}, a.begin(), n, [=] _CCCL_DEVICE(const T& bi, const T& ci) {
    return bi + scalar * ci;
    });
  2. [=] _CCCL_DEVICE(const T& ai, const T& bi, const T& ci) {
    return ai + bi + scalar * ci;
    },

In the cases above, we'd prefer to have some tuning instead of pessimizing performance.
This issue is meant as a discussion / design point on our options in avoiding this pessimization.

Describe the solution you'd like

We should revisit our approach on tuning unknown workloads.
There are a few options that we have considered.

Opt-In Scheme

We could give users an opt-in mechanism to let CUB know that given value type / operator combination are trivial:

cub::DeviceReduce(..., 
  proclaim_trivial_case(
    []__host__ __device__(char a, char b) -> char { return a + b }
  ));

Then, on the tuning end, we'd relax tuning from:

template <class T> 
struct sm90_tuning<T, primitive_op::yes, primitive_accum::yes, accum_size::_1> : sm90_tuning_vals<T, 192, 22, 168, 1140> {};

to something like:

template <class T> 
struct sm90_tuning<T, accum_size::_1> : sm90_tuning_vals<T, 192, 22, 168, 1140> {};

Pros

  • User gains access to the tunings we have for trivial cases
  • Heavy-weight operations do not experience performance regressions
  • Annotation can propagate through the stack. Thrust user can annotate a function and this annotation would go all the way to CUB.

Cons

  • Manual annotation means that some users will leave this option on the table

Opt-Out Scheme

Alternatively, we could apply existing tunings by default and give users a way to opt-out of these tunings:

cub::DeviceReduce(..., 
  proclaim_heavy_weight_case(
    []__host__ __device__(cuda::std::complex<float> a, cuda::std::complex<float> b) -> char { 
      return heavy_weight_comparison(a, b) ? a : b; }
  ));

Pros

  • Common use cases will likely experience performance improvements
  • We'll be able to recognize heavy-weight workloads, and, say, remove #pragma unroll from CUB kernels
  • We'll be able to go beyond tuning, and select completely different (work-efficient) implementations of parallel algorithms for this case
  • Annotation can propagate through the stack. Thrust user can annotate a function and this annotation would go all the way to CUB.

Cons

  • Users providing heavy-weight operations will likely experience perf regressions and will have to manually annotate their code

I incline towards opt-out scheme.

Before closing this issue, we should allow for some time to gather more opinions.
After that, the issue can be closed by a design of function wrapper with an example of opt-in / opt-out (depending on what we choose) behavior in one of CUB algorithms.

Describe alternatives you've considered

No response

Additional context

No response

@gevtushenko gevtushenko added the feature request New feature or request. label Jan 28, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jan 28, 2025
@gevtushenko gevtushenko added thrust For all items related to Thrust. cub For all items related to CUB labels Jan 28, 2025
@gevtushenko
Copy link
Collaborator Author

@elstehle, @bernhardmgruber, @shwina, @jrhemstad, @oleksandr-pavlyk, @griwes I'd be interested to hear your opinion if you don't think opt-out scheme is a good idea.

@gevtushenko
Copy link
Collaborator Author

Related issue: #754

@ahendriksen
Copy link
Contributor

The new thrust::transform does not use unrolling for performance and can auto-tune based on occupancy. It is not a solution that can be easily applied to all other algorithms, but I wanted you to be aware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB feature request New feature or request. thrust For all items related to Thrust.
Projects
Status: Todo
Development

No branches or pull requests

2 participants