Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple kernel for deterministic reduction #2234

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

SAtacker
Copy link

Description

RFA - #1558 (comment)

This PR includes the RFA Kernel in CUB which tiles over the input and reduces the input sum deterministically.

Checklist

  • I am familiar with the Contributing Guidelines
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@SAtacker SAtacker requested review from a team as code owners August 13, 2024 20:43
@SAtacker SAtacker requested review from elstehle and gonidelis August 13, 2024 20:43
Copy link

copy-pr-bot bot commented Aug 13, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early comments. Please make sure that we reuse as much internal facilities as possible

cub/benchmarks/bench/reduce/deterministic.cu Outdated Show resolved Hide resolved
cub/benchmarks/bench/reduce/deterministic.cu Outdated Show resolved Hide resolved
cub/cub/detail/rfa.cuh Outdated Show resolved Hide resolved
Comment on lines 50 to 51
#ifndef __CUDACC__
# define __host__
# define __device__
# define __forceinline__
# include <array>
using std::array;
using std::max;
using std::min;
#else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should not be necessary.

We already define fallbacks for the macros, see _CCCL_HOST, _CCCL_DEVICE, _CCCL_FORCEINLINE

Also libcu++ works without nvcc, so all this should just include cuda/std/*

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also libcu++ works without nvcc, so all this should just include cuda/std/*

Do you mean I should get rid of the std::* ? If that's the case then I don't think we can do that because some of the code is reusable on host. Can you please clarify?

cub/cub/detail/rfa.cuh Outdated Show resolved Hide resolved
cub/cub/detail/rfa.cuh Outdated Show resolved Hide resolved
}
};

/// Class to hold a reproducible summation of the numbers passed to it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving towards rst style documentation, so please use //! instead of ///. Applies throughout

Copy link
Contributor

@bernhardmgruber bernhardmgruber Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a general rule? I wondered a few times already whether we should write down some of those guidelines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing I only do that where we actually need the docs for the function and rest of the comments can stay.

cub/cub/detail/rfa.cuh Outdated Show resolved Hide resolved
Comment on lines +357 to +354
const auto& c = *this;
return const_cast<ftype&>(c.primary(i));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the purpose of the c temporary and the const cast. Could you elaborate why it is needed and not just:

Suggested change
const auto& c = *this;
return const_cast<ftype&>(c.primary(i));
return primary(i);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't that be a recursive call to itself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miscco I think here is your answer

cub/cub/detail/rfa.cuh Outdated Show resolved Hide resolved
/// Return primary vector value const ref
__host__ __device__ __forceinline__ const ftype& primary(int i) const
{
if constexpr (FOLD <= MAX_JUMP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is the use of C++17 blessed? IIUC, we can require C++17 for entirely new features in CUB/Thrust, but everything that changes existing infrastructure has to remain compatible with C++11 until the big drop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new API, it's a compromise. If it takes little effort, we support C++11. It it'll take us a month to backport new API to C++11, we can avoid doing that.

Copy link
Author

@SAtacker SAtacker Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed significant code to not be able to compare the binary and determine if this is taking a toll on performance. Although I am guessing it should not :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a question of performance, it's rather about your tolerance of pain :) (and size of user base that can immediately use this feature).

@gevtushenko gevtushenko marked this pull request as draft August 15, 2024 17:24
@SAtacker SAtacker force-pushed the simple_kernel_multiple_tiles_pr branch 5 times, most recently from 1ea4089 to 8a01d1e Compare August 16, 2024 23:37
@SAtacker
Copy link
Author

SAtacker commented Sep 1, 2024

@gevtushenko Hello! Just a gentle ping about this PR. Thanks!

@SAtacker SAtacker force-pushed the simple_kernel_multiple_tiles_pr branch 2 times, most recently from eac7525 to 62d291a Compare September 10, 2024 16:22
@SAtacker SAtacker marked this pull request as ready for review September 10, 2024 16:32
InputIteratorT d_in,
AccumT* d_out,
OffsetT num_items,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

using float_type_list = c2h::type_list<float, double>;

template <typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
CUB_RUNTIME_FUNCTION static cudaError_t DeterministicSum(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might sound naive but I am a bit confused. Why are we declaring DeterministicSum in a test file? Don't we want this to be some interface under a header? If that's how the users are supposed to be using it the call to cub::detail::DeterministicDispatchReduce::Dispatch still gives me the creeps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gevtushenko and others have not yet finalized the API and hence we have it here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal was to have minimal diff and no changes in existing codebase for now

}

template <int NOMINAL_BLOCK_THREADS_4B, int NOMINAL_ITEMS_PER_THREAD_4B>
struct AgentReducePolicy
Copy link
Member

@gonidelis gonidelis Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. Why is the Agent*Policy in a test file? Don't we want it to be interface available? Seems like you are cogently also using them in the benchmark file, which makes it +1 reason to separate them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to test the determinism of the algorithm. By having different block size and threads per block we try to simulate different order for the reduction (summation) and if it is not deterministic it would fail.

@gonidelis
Copy link
Member

Please try rebasing to main.

Shreyas Atre and others added 7 commits September 25, 2024 08:51
Signed-off-by: Shreyas Atre <[email protected]>
- Remove unncessary functions
- Fix a minor comment mentioning FOLD

Signed-off-by: Shreyas Atre <[email protected]>
Signed-off-by: Shreyas Atre <[email protected]>
@SAtacker SAtacker force-pushed the simple_kernel_multiple_tiles_pr branch from 62d291a to 327c2a2 Compare September 25, 2024 03:21
@wence-
Copy link
Contributor

wence- commented Nov 21, 2024

@jrhemstad pointed this implementation out to me. I think it would be useful, as part of this implementation, to document the error bounds the new algorithm has.

Specifically, not only is it deterministic, but it also has much better error than the naive recursive summation (std::accumulate), or even pairwise summation (tree-based divide-and-conquer, e.g. allowed by std::reduce).

My understanding is that this PR implements the algorithm of Ahrens, Demmel, and Nguyen (2020), I think choice of number of bins is free (but the default uses the recommended number).

Summary of my best efforts to survey "good" approaches. Suppose we are computing the sum of $n$ numbers $(x_1, \dots, x_n)$. Denote the exact sum $T := \sum_i x_i$, and the approximate as $T + E$ for some error $E$.

The best possible floating point approximation to this sum is the faithfully rounded result $\text{fl}(T)$, this result is obtained by the FastAccSum algorithm of Rump (2009), this has error bound

$$ |E| \le \epsilon |T|, $$

where $\epsilon$ is machine epsilon.

The "classic" compensated summation approach of Kahan has

$$ |E| \le \left(2 \epsilon + \mathcal{O}(n \epsilon^2)\right) \sum_i |x_i|. $$

The Ahrens et al. approach has (eq. 6.1 from the linked paper), using the values they suggest as free parameters for double precision

$$ |E| \le 2^{-80} n \max_i |x_i| + 7 \epsilon |T|. $$

So this new implementation is worse than Kahan on paper, but in (perhaps typical?) most cases not by much.

To compare, since the non-deterministic DeviceSum does (approximately) divide-and-conquer, the error bound is, I think, that of pairwise summation

$$ |E| \le \frac{\epsilon \log_2 n}{1 - \epsilon \log_2 n} \sum_i |x_i| = \left(\epsilon \log_2 n + \mathcal{O}\left((\epsilon \log_2 n)^2\right)\right) \sum_i |x_i|, $$

see, e.g., eq. (4.6) of Higham, Accuracy and stability of numerical algorithms (2002).

Aside, the bound you get from recursive summation (what you get if you write std::accumulate) is

$$ |E| \le (n - 1) \epsilon \sum_i |x_i| + \mathcal{O}(\epsilon^2), $$

again, see Higham's book, this time eq. (4.4).

@SAtacker
Copy link
Author

Substituting $ε$ in your equations for double precision i.e. $10^{-16}$ so that it's easier for me to read

The "classic" compensated summation approach of Kahan has ...

For IEEE standard double-precision floating point, Kahan error becomes:

$$10^{-16} * T + O(n*10^{-32})*T$$

Ahrens et al. approach has (eq. 6.1 from the linked paper) ...

In Table 2 Suggested Parameter Settings

$$2^{-50} * T + 2^{-80} * T'$$

or Approx.

$$10^{-15} * T + 10^{-24} * T'$$

Where $T'$ is $n*max(x_j)$ and $T$ is the exact sum

So this new implementation is worse than Kahan on paper, but in (perhaps typical?) most cases not by much.

It is perhaps slightly worse and not much worse (for double precision) than Kahan?

Also for pairwise summation, approx.

$$10^{-16}*\log{_2}{n}*T + O(10^{-32}*\log^2{_2}{n})*T$$

Certainly not better than pairwise summation perhaps.

@wence-
Copy link
Contributor

wence- commented Nov 21, 2024

Certainly not better than pairwise summation perhaps.

This is not quite right, note that the error bounds in Kahan and similar use $\sum_i |x_i| \ne |T|$ (unless all values have the same sign).

But we can consider some examples. Suppose we're summing $10^9$ elements that are all approximately the same size, and all the same sign. Then:

Rump:

$$ |E| = 10^{-16} |T| \lessapprox 10^{-16} \cdot 10^9 \max_i |x_i| $$

Kahan:

$$ |E| \le (2 \cdot 10^{-16} + 10^9 \cdot 10^{-32}) \sum_i |x_i| \lessapprox 2 \cdot 10^{-16} \cdot 10^{9} \max_i |x_i| $$

Ahrens:

$$ |E| \le 7 \cdot 10^{-16} |T| + 10^{-24} \cdot 10^9 \max_i |x_i| \lessapprox 7 \cdot 10^{-16} \cdot 10^9 \max_i |x_i| $$

Pairwise:

$$ |E| \le 10^{-16} \log_2 10^9 \sum_i |x_i| \lessapprox 3 \cdot 10^{-15} \cdot 10^9 \max_i |x_i| $$

So in this "easy" scenario, Ahrens is about 3.5 times worse than Kahan, and 4 times better than pairwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

6 participants