Add simple kernel for deterministic reduction #2234

SAtacker · 2024-08-13T20:43:49Z

Description

RFA - #1558 (comment)

This PR includes the RFA Kernel in CUB which tiles over the input and reduces the input sum deterministically.

Checklist

I am familiar with the Contributing Guidelines
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-08-13T20:43:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

miscco

Some early comments. Please make sure that we reuse as much internal facilities as possible

cub/benchmarks/bench/reduce/deterministic.cu

cub/cub/detail/rfa.cuh

miscco · 2024-08-14T06:58:29Z

cub/cub/detail/rfa.cuh

+#ifndef __CUDACC__
+#  define __host__
+#  define __device__
+#  define __forceinline__
+#  include <array>
+using std::array;
+using std::max;
+using std::min;
+#else


That should not be necessary.

We already define fallbacks for the macros, see _CCCL_HOST, _CCCL_DEVICE, _CCCL_FORCEINLINE

Also libcu++ works without nvcc, so all this should just include cuda/std/*

Also libcu++ works without nvcc, so all this should just include cuda/std/*

Do you mean I should get rid of the std::* ? If that's the case then I don't think we can do that because some of the code is reusable on host. Can you please clarify?

cub/cub/detail/rfa.cuh

miscco · 2024-08-14T07:06:43Z

cub/cub/detail/rfa.cuh

+  }
+};
+
+/// Class to hold a reproducible summation of the numbers passed to it


We are moving towards rst style documentation, so please use //! instead of ///. Applies throughout

Is this a general rule? I wondered a few times already whether we should write down some of those guidelines.

I am guessing I only do that where we actually need the docs for the function and rest of the comments can stay.

cub/cub/detail/rfa.cuh

miscco · 2024-08-14T07:10:46Z

cub/cub/detail/rfa.cuh

+    const auto& c = *this;
+    return const_cast<ftype&>(c.primary(i));


I do not understand the purpose of the c temporary and the const cast. Could you elaborate why it is needed and not just:

Suggested change

const auto& c = *this;

return const_cast<ftype&>(c.primary(i));

return primary(i);

Won't that be a recursive call to itself?

@miscco I think here is your answer

cub/cub/detail/rfa.cuh

bernhardmgruber · 2024-08-14T08:21:48Z

cub/cub/detail/rfa.cuh

+  /// Return primary vector value const ref
+  __host__ __device__ __forceinline__ const ftype& primary(int i) const
+  {
+    if constexpr (FOLD <= MAX_JUMP)


Question: is the use of C++17 blessed? IIUC, we can require C++17 for entirely new features in CUB/Thrust, but everything that changes existing infrastructure has to remain compatible with C++11 until the big drop.

For new API, it's a compromise. If it takes little effort, we support C++11. It it'll take us a month to backport new API to C++11, we can avoid doing that.

I have changed significant code to not be able to compare the binary and determine if this is taking a toll on performance. Although I am guessing it should not :-)

This is not a question of performance, it's rather about your tolerance of pain :) (and size of user base that can immediately use this feature).

cub/cub/device/device_reduce.cuh

SAtacker · 2024-09-01T19:12:47Z

@gevtushenko Hello! Just a gentle ping about this PR. Thanks!

gonidelis · 2024-09-12T07:10:59Z

cub/cub/device/dispatch/dispatch_reduce_deterministic.cuh

+  InputIteratorT d_in,
+  AccumT* d_out,
+  OffsetT num_items,
+


Suggested change

gonidelis · 2024-09-12T07:15:02Z

cub/test/catch2_test_device_reduce_deterministic.cu

+using float_type_list = c2h::type_list<float, double>;
+
+template <typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
+CUB_RUNTIME_FUNCTION static cudaError_t DeterministicSum(


This might sound naive but I am a bit confused. Why are we declaring DeterministicSum in a test file? Don't we want this to be some interface under a header? If that's how the users are supposed to be using it the call to cub::detail::DeterministicDispatchReduce::Dispatch still gives me the creeps.

@gevtushenko and others have not yet finalized the API and hence we have it here.

The goal was to have minimal diff and no changes in existing codebase for now

gonidelis · 2024-09-12T07:16:52Z

cub/test/catch2_test_device_reduce_deterministic.cu

+}
+
+template <int NOMINAL_BLOCK_THREADS_4B, int NOMINAL_ITEMS_PER_THREAD_4B>
+struct AgentReducePolicy


Same. Why is the Agent*Policy in a test file? Don't we want it to be interface available? Seems like you are cogently also using them in the benchmark file, which makes it +1 reason to separate them?

This is to test the determinism of the algorithm. By having different block size and threads per block we try to simulate different order for the reduction (summation) and if it is not deterministic it would fail.

gonidelis · 2024-09-12T07:33:04Z

Please try rebasing to main.

Signed-off-by: Shreyas Atre <[email protected]>

- Remove unncessary functions - Fix a minor comment mentioning FOLD Signed-off-by: Shreyas Atre <[email protected]>

Signed-off-by: Shreyas Atre <[email protected]>

- Move the function to tests Signed-off-by: Shreyas Atre <[email protected]>

Signed-off-by: Shreyas Atre <[email protected]>

wence- · 2024-11-21T15:39:07Z

@jrhemstad pointed this implementation out to me. I think it would be useful, as part of this implementation, to document the error bounds the new algorithm has.

Specifically, not only is it deterministic, but it also has much better error than the naive recursive summation (std::accumulate), or even pairwise summation (tree-based divide-and-conquer, e.g. allowed by std::reduce).

My understanding is that this PR implements the algorithm of Ahrens, Demmel, and Nguyen (2020), I think choice of number of bins is free (but the default uses the recommended number).

Summary of my best efforts to survey "good" approaches. Suppose we are computing the sum of $n$ numbers $(x_1, \dots, x_n)$. Denote the exact sum $T := \sum_i x_i$, and the approximate as $T + E$ for some error $E$.

The best possible floating point approximation to this sum is the faithfully rounded result $\text{fl}(T)$, this result is obtained by the FastAccSum algorithm of Rump (2009), this has error bound

$$ |E| \le \epsilon |T|, $$

where $\epsilon$ is machine epsilon.

The "classic" compensated summation approach of Kahan has

$$ |E| \le \left(2 \epsilon + \mathcal{O}(n \epsilon^2)\right) \sum_i |x_i|. $$

The Ahrens et al. approach has (eq. 6.1 from the linked paper), using the values they suggest as free parameters for double precision

$$ |E| \le 2^{-80} n \max_i |x_i| + 7 \epsilon |T|. $$

So this new implementation is worse than Kahan on paper, but in (perhaps typical?) most cases not by much.

To compare, since the non-deterministic DeviceSum does (approximately) divide-and-conquer, the error bound is, I think, that of pairwise summation

$$ |E| \le \frac{\epsilon \log_2 n}{1 - \epsilon \log_2 n} \sum_i |x_i| = \left(\epsilon \log_2 n + \mathcal{O}\left((\epsilon \log_2 n)^2\right)\right) \sum_i |x_i|, $$

see, e.g., eq. (4.6) of Higham, Accuracy and stability of numerical algorithms (2002).

Aside, the bound you get from recursive summation (what you get if you write std::accumulate) is

$$ |E| \le (n - 1) \epsilon \sum_i |x_i| + \mathcal{O}(\epsilon^2), $$

again, see Higham's book, this time eq. (4.4).

SAtacker · 2024-11-21T17:14:25Z

Substituting $ε$ in your equations for double precision i.e. $10^{-16}$ so that it's easier for me to read

The "classic" compensated summation approach of Kahan has ...

For IEEE standard double-precision floating point, Kahan error becomes:

$$10^{-16} * T + O(n*10^{-32})*T$$

Ahrens et al. approach has (eq. 6.1 from the linked paper) ...

In Table 2 Suggested Parameter Settings

$$2^{-50} * T + 2^{-80} * T'$$

or Approx.

$$10^{-15} * T + 10^{-24} * T'$$

Where $T'$ is $n*max(x_j)$ and $T$ is the exact sum

So this new implementation is worse than Kahan on paper, but in (perhaps typical?) most cases not by much.

It is perhaps slightly worse and not much worse (for double precision) than Kahan?

Also for pairwise summation, approx.

$$10^{-16}*\log{_2}{n}*T + O(10^{-32}*\log^2{_2}{n})*T$$

Certainly not better than pairwise summation perhaps.

wence- · 2024-11-21T17:27:17Z

Certainly not better than pairwise summation perhaps.

This is not quite right, note that the error bounds in Kahan and similar use $\sum_i |x_i| \ne |T|$ (unless all values have the same sign).

But we can consider some examples. Suppose we're summing $10^9$ elements that are all approximately the same size, and all the same sign. Then:

Rump:

$$ |E| = 10^{-16} |T| \lessapprox 10^{-16} \cdot 10^9 \max_i |x_i| $$

Kahan:

$$ |E| \le (2 \cdot 10^{-16} + 10^9 \cdot 10^{-32}) \sum_i |x_i| \lessapprox 2 \cdot 10^{-16} \cdot 10^{9} \max_i |x_i| $$

Ahrens:

$$ |E| \le 7 \cdot 10^{-16} |T| + 10^{-24} \cdot 10^9 \max_i |x_i| \lessapprox 7 \cdot 10^{-16} \cdot 10^9 \max_i |x_i| $$

Pairwise:

$$ |E| \le 10^{-16} \log_2 10^9 \sum_i |x_i| \lessapprox 3 \cdot 10^{-15} \cdot 10^9 \max_i |x_i| $$

So in this "easy" scenario, Ahrens is about 3.5 times worse than Kahan, and 4 times better than pairwise.

SAtacker requested review from a team as code owners August 13, 2024 20:43

SAtacker requested review from elstehle and gonidelis August 13, 2024 20:43

miscco reviewed Aug 14, 2024

View reviewed changes

bernhardmgruber reviewed Aug 14, 2024

View reviewed changes

gevtushenko marked this pull request as draft August 15, 2024 17:24

gevtushenko reviewed Aug 15, 2024

View reviewed changes

cub/cub/device/device_reduce.cuh Outdated Show resolved Hide resolved

SAtacker force-pushed the simple_kernel_multiple_tiles_pr branch 5 times, most recently from 1ea4089 to 8a01d1e Compare August 16, 2024 23:37

bernhardmgruber mentioned this pull request Aug 18, 2024

Implement bit_cast for libcu++ #2257

Closed

SAtacker force-pushed the simple_kernel_multiple_tiles_pr branch 2 times, most recently from eac7525 to 62d291a Compare September 10, 2024 16:22

SAtacker marked this pull request as ready for review September 10, 2024 16:32

gonidelis reviewed Sep 12, 2024

View reviewed changes

cub/cub/device/dispatch/dispatch_reduce_deterministic.cuh

InputIteratorT d_in,

AccumT* d_out,

OffsetT num_items,

Copy link

Member

gonidelis Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

gonidelis reviewed Sep 12, 2024

View reviewed changes

Shreyas Atre and others added 7 commits September 25, 2024 08:51

Add simple kernel for deterministic reduction

51b761d

Signed-off-by: Shreyas Atre <[email protected]>

Change name to NVIDIA license

20f391d

Signed-off-by: Shreyas Atre <[email protected]>

Fix macros and make c++ 11 compatible

b24cc2d

- Remove unncessary functions - Fix a minor comment mentioning FOLD Signed-off-by: Shreyas Atre <[email protected]>

Use NV_IF_TARGET instead of CUDA_ARCH macro

0db704d

Signed-off-by: Shreyas Atre <[email protected]>

Fix comments style wherever necessary

2e13048

Signed-off-by: Shreyas Atre <[email protected]>

Fix NV_IF_TARGET usage

8391ddb

Signed-off-by: Shreyas Atre <[email protected]>

rfa_float_transform_t works on host and device

87da53c

Signed-off-by: Shreyas Atre <[email protected]>

SAtacker added 5 commits September 25, 2024 08:51

Fix a copy paste error

573d6b9

Signed-off-by: Shreyas Atre <[email protected]>

API not finalized yet

f6238e7

- Move the function to tests Signed-off-by: Shreyas Atre <[email protected]>

Fix names in debug log and documentation for deterministic reduce

1ea5043

Signed-off-by: Shreyas Atre <[email protected]>

Mention that only cub::Sum operator is supported today

8ba1ad4

Signed-off-by: Shreyas Atre <[email protected]>

Fix macro naming to avoid resuing unintentionally

327c2a2

Signed-off-by: Shreyas Atre <[email protected]>

SAtacker force-pushed the simple_kernel_multiple_tiles_pr branch from 62d291a to 327c2a2 Compare September 25, 2024 03:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add simple kernel for deterministic reduction #2234

Add simple kernel for deterministic reduction #2234

SAtacker commented Aug 13, 2024

copy-pr-bot bot commented Aug 13, 2024

miscco left a comment

miscco Aug 14, 2024

SAtacker Aug 16, 2024

miscco Aug 14, 2024

bernhardmgruber Aug 14, 2024 •

edited

Loading

SAtacker Aug 16, 2024

miscco Aug 14, 2024

SAtacker Aug 16, 2024

gonidelis Sep 12, 2024

bernhardmgruber Aug 14, 2024

gevtushenko Aug 15, 2024

SAtacker Aug 16, 2024 •

edited

Loading

bernhardmgruber Aug 18, 2024

SAtacker commented Sep 1, 2024

gonidelis Sep 12, 2024

gonidelis Sep 12, 2024

SAtacker Sep 25, 2024

SAtacker Sep 25, 2024

gonidelis Sep 12, 2024 •

edited

Loading

SAtacker Sep 25, 2024

gonidelis commented Sep 12, 2024

wence- commented Nov 21, 2024 •

edited

Loading

SAtacker commented Nov 21, 2024

wence- commented Nov 21, 2024 •

edited

Loading

		const auto& c = *this;
		return const_cast<ftype&>(c.primary(i));

	const auto& c = *this;
	return const_cast<ftype&>(c.primary(i));
	return primary(i);

Add simple kernel for deterministic reduction #2234

Are you sure you want to change the base?

Add simple kernel for deterministic reduction #2234

Conversation

SAtacker commented Aug 13, 2024

Description

Checklist

copy-pr-bot bot commented Aug 13, 2024

miscco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernhardmgruber Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SAtacker Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SAtacker commented Sep 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gonidelis Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gonidelis commented Sep 12, 2024

wence- commented Nov 21, 2024 • edited Loading

SAtacker commented Nov 21, 2024

wence- commented Nov 21, 2024 • edited Loading

bernhardmgruber Aug 14, 2024 •

edited

Loading

SAtacker Aug 16, 2024 •

edited

Loading

gonidelis Sep 12, 2024 •

edited

Loading

wence- commented Nov 21, 2024 •

edited

Loading

wence- commented Nov 21, 2024 •

edited

Loading