-
Notifications
You must be signed in to change notification settings - Fork 207
Change streaming algorithms to use operator+= from using operator+ #4428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Change streaming algorithms to use operator+= from using operator+ #4428
Conversation
1. Changed streaming algorithm to support num_segments in segmented_reduce that is greater than INT_MAX from using operator+ to increment iterators on the host to using operator+= to save constructor/copy calls. 2. Introduce `void advance_iterators_inplace_if_supported(IteratorT &iter, OffsetT diff)` that use operator+= alongside existing `IteratorT advance_iterators_if_supported(IteratorT iter, OffsetT diff)`. 3. Since these are used from dispatcher functions annotated as CUB_RUNTIME_FUNCTION, changed these functions's annotations from _CCCL_HOST_DEVICE to CUB_RUNTIME_FUNCTION. This change broke test_nvrtc test since dispatch_common.cuh header file where these functions are defined is used in agent_select_if.cuh file that must be compiled by NVRTC. 4. Hence I moved functions for advancing iterators from dispatch_common.cuh into a new header file dispatch_advance_iterators.cuh which are included from dispatch_reduce.cuh and dispatch_radix_sort.cuh and moved definitiosn of relevant functions from dispatch_common to dispatch_advance_iterators
🟨 CI finished in 3h 44m: Pass: 95%/103 | Total: 2d 22h | Avg: 41m 02s | Max: 1h 28m | Hits: 75%/134359
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
stdpar | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 103)
# | Runner |
---|---|
72 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-arm64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-amd64-gpu-rtx2080-latest-1 |
3 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those functions are really dangerous and I would strongly prefer to just use cuda::std::advance
rather than those
// Helper function that advances a given iterator only if it supports being advanced by the given offset | ||
template <typename IteratorT, typename OffsetT> | ||
CUB_RUNTIME_FUNCTION _CCCL_VISIBILITY_HIDDEN _CCCL_FORCEINLINE IteratorT | ||
advance_iterators_if_supported(IteratorT iter, [[maybe_unused]] OffsetT offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why dont we just use cuda::std::advance
?
For the sake of my own understanding, the danger comes from current code disregarding iterator tags? |
The danger comes from operation conditionally doing things. We should ensure that we always properly combine the "fast" path with the fallback As written I see this as quite dangerous and I also do not understand why we cannot just use |
detail::advance_iterators_inplace_if_supported(d_begin_offsets, num_current_segments); | ||
detail::advance_iterators_inplace_if_supported(d_end_offsets, num_current_segments); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be
detail::advance_iterators_inplace_if_supported(d_begin_offsets, num_current_segments); | |
detail::advance_iterators_inplace_if_supported(d_end_offsets, num_current_segments); | |
::cuda::std::advance(d_begin_offsets, num_current_segments); | |
::cuda::std::advance(d_end_offsets, num_current_segments); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, making this change would break c.parallel build. We would still need
if constexpr (has_add_assign_operator) {
:cuda::std::advance(d_begin_offsets, num_current_segments);
}
@miscco The conditional around increments are done are done to support compiling of A run-time check is inserted before the increment iteration loop to ensure that code execution never reaches the loop. The |
The only downside (consequence) of using |
Why dont we implement += for |
Because struct indirect_arg_t
{
void* ptr;
indirect_arg_t(cccl_iterator_t& it)
: ptr(it.type == cccl_iterator_kind_t::CCCL_POINTER ? &it.state : it.state)
{}
void* operator&() const
{
return ptr;
}
}; |
This is because InvokePass may be called multiple times by InvokePasses due to algorithmic nature of radix sorting. With this chanage, InvokePass creates local copies of `d_begin_offsets` and `d_end_offsets` and advances these copies in-place if necessary.
@elstehle identified the reason for the CI failures. Transition to advance iterators in-place in The solution is for |
🟩 CI finished in 2h 06m: Pass: 100%/103 | Total: 2d 23h | Avg: 41m 38s | Max: 1h 30m | Hits: 76%/140524
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
stdpar | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 103)
# | Runner |
---|---|
72 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-arm64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-amd64-gpu-rtx2080-latest-1 |
3 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 38m: Pass: 100%/103 | Total: 1d 02h | Avg: 15m 13s | Max: 1h 19m | Hits: 96%/140524
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
stdpar | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 103)
# | Runner |
---|---|
72 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-arm64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-amd64-gpu-rtx2080-latest-1 |
3 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
Changed streaming algorithm to support num_segments in segmented_reduce that is greater than INT_MAX from using operator+ to increment iterators on the host to using operator+= to save constructor/copy calls.
Introduce
void advance_iterators_inplace_if_supported(IteratorT &iter, OffsetT diff)
that use operator+= alongside existingIteratorT advance_iterators_if_supported(IteratorT iter, OffsetT diff)
.Since these are used from dispatcher functions annotated as CUB_RUNTIME_FUNCTION, changed these functions's annotations from _CCCL_HOST_DEVICE to CUB_RUNTIME_FUNCTION.
This change broke test_nvrtc test since dispatch_common.cuh header file where these functions are defined is used in agent_select_if.cuh file that must be compiled by NVRTC.
Hence I moved functions for advancing iterators from dispatch_common.cuh into a new header file dispatch_advance_iterators.cuh which are included from dispatch_reduce.cuh and dispatch_radix_sort.cuh and moved definitiosn of relevant functions from dispatch_common to dispatch_advance_iterators
Description
closes
Checklist