-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add simple kernel for deterministic reduction #2234
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early comments. Please make sure that we reuse as much internal facilities as possible
cub/cub/detail/rfa.cuh
Outdated
#ifndef __CUDACC__ | ||
# define __host__ | ||
# define __device__ | ||
# define __forceinline__ | ||
# include <array> | ||
using std::array; | ||
using std::max; | ||
using std::min; | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should not be necessary.
We already define fallbacks for the macros, see _CCCL_HOST
, _CCCL_DEVICE
, _CCCL_FORCEINLINE
Also libcu++ works without nvcc, so all this should just include cuda/std/*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also libcu++ works without nvcc, so all this should just include cuda/std/*
Do you mean I should get rid of the std::*
? If that's the case then I don't think we can do that because some of the code is reusable on host. Can you please clarify?
cub/cub/detail/rfa.cuh
Outdated
} | ||
}; | ||
|
||
/// Class to hold a reproducible summation of the numbers passed to it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving towards rst style documentation, so please use //!
instead of ///
. Applies throughout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a general rule? I wondered a few times already whether we should write down some of those guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing I only do that where we actually need the docs for the function and rest of the comments can stay.
const auto& c = *this; | ||
return const_cast<ftype&>(c.primary(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the purpose of the c
temporary and the const cast. Could you elaborate why it is needed and not just:
const auto& c = *this; | |
return const_cast<ftype&>(c.primary(i)); | |
return primary(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't that be a recursive call to itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cub/cub/detail/rfa.cuh
Outdated
/// Return primary vector value const ref | ||
__host__ __device__ __forceinline__ const ftype& primary(int i) const | ||
{ | ||
if constexpr (FOLD <= MAX_JUMP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is the use of C++17 blessed? IIUC, we can require C++17 for entirely new features in CUB/Thrust, but everything that changes existing infrastructure has to remain compatible with C++11 until the big drop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new API, it's a compromise. If it takes little effort, we support C++11. It it'll take us a month to backport new API to C++11, we can avoid doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed significant code to not be able to compare the binary and determine if this is taking a toll on performance. Although I am guessing it should not :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a question of performance, it's rather about your tolerance of pain :) (and size of user base that can immediately use this feature).
1ea4089
to
8a01d1e
Compare
@gevtushenko Hello! Just a gentle ping about this PR. Thanks! |
eac7525
to
62d291a
Compare
InputIteratorT d_in, | ||
AccumT* d_out, | ||
OffsetT num_items, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using float_type_list = c2h::type_list<float, double>; | ||
|
||
template <typename InputIteratorT, typename OutputIteratorT, typename NumItemsT> | ||
CUB_RUNTIME_FUNCTION static cudaError_t DeterministicSum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might sound naive but I am a bit confused. Why are we declaring DeterministicSum
in a test file? Don't we want this to be some interface under a header? If that's how the users are supposed to be using it the call to cub::detail::DeterministicDispatchReduce::Dispatch
still gives me the creeps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gevtushenko and others have not yet finalized the API and hence we have it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal was to have minimal diff and no changes in existing codebase for now
} | ||
|
||
template <int NOMINAL_BLOCK_THREADS_4B, int NOMINAL_ITEMS_PER_THREAD_4B> | ||
struct AgentReducePolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. Why is the Agent*Policy
in a test file? Don't we want it to be interface available? Seems like you are cogently also using them in the benchmark file, which makes it +1 reason to separate them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to test the determinism of the algorithm. By having different block size and threads per block we try to simulate different order for the reduction (summation) and if it is not deterministic it would fail.
Please try rebasing to main. |
Signed-off-by: Shreyas Atre <[email protected]>
Signed-off-by: Shreyas Atre <[email protected]>
- Remove unncessary functions - Fix a minor comment mentioning FOLD Signed-off-by: Shreyas Atre <[email protected]>
Signed-off-by: Shreyas Atre <[email protected]>
Signed-off-by: Shreyas Atre <[email protected]>
Signed-off-by: Shreyas Atre <[email protected]>
Signed-off-by: Shreyas Atre <[email protected]>
Signed-off-by: Shreyas Atre <[email protected]>
- Move the function to tests Signed-off-by: Shreyas Atre <[email protected]>
Signed-off-by: Shreyas Atre <[email protected]>
Signed-off-by: Shreyas Atre <[email protected]>
Signed-off-by: Shreyas Atre <[email protected]>
62d291a
to
327c2a2
Compare
@jrhemstad pointed this implementation out to me. I think it would be useful, as part of this implementation, to document the error bounds the new algorithm has. Specifically, not only is it deterministic, but it also has much better error than the naive recursive summation ( My understanding is that this PR implements the algorithm of Ahrens, Demmel, and Nguyen (2020), I think choice of number of bins is free (but the default uses the recommended number). Summary of my best efforts to survey "good" approaches. Suppose we are computing the sum of The best possible floating point approximation to this sum is the faithfully rounded result where The "classic" compensated summation approach of Kahan has The Ahrens et al. approach has (eq. 6.1 from the linked paper), using the values they suggest as free parameters for double precision So this new implementation is worse than Kahan on paper, but in (perhaps typical?) most cases not by much. To compare, since the non-deterministic see, e.g., eq. (4.6) of Higham, Accuracy and stability of numerical algorithms (2002). Aside, the bound you get from recursive summation (what you get if you write again, see Higham's book, this time eq. (4.4). |
Substituting
For IEEE standard double-precision floating point, Kahan error becomes:
In Table 2 Suggested Parameter Settings or Approx. Where
It is perhaps slightly worse and not much worse (for double precision) than Kahan? Also for pairwise summation, approx. Certainly not better than pairwise summation perhaps. |
This is not quite right, note that the error bounds in Kahan and similar use But we can consider some examples. Suppose we're summing Rump: Kahan: Ahrens: Pairwise: So in this "easy" scenario, Ahrens is about 3.5 times worse than Kahan, and 4 times better than pairwise. |
Description
RFA - #1558 (comment)
This PR includes the RFA Kernel in CUB which tiles over the input and reduces the input sum deterministically.
Checklist