-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract merge sort kernels to NVRTC compilable header #3438
Extract merge sort kernels to NVRTC compilable header #3438
Conversation
…ator, and use ::cuda::std instead of std to avoid nvrtc errors
🟨 CI finished in 1h 54m: Pass: 96%/78 | Total: 2d 02h | Avg: 39m 03s | Max: 1h 13m | Hits: 189%/10972
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 78)
# | Runner |
---|---|
53 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 2h 10m: Pass: 100%/78 | Total: 2d 08h | Avg: 43m 17s | Max: 1h 17m | Hits: 195%/12824
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 78)
# | Runner |
---|---|
53 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already looking great
I need to close my eyes on the things thrust is doing with the iterators, but that is preexisting and not part of this PR
cub/cub/agent/agent_merge_sort.cuh
Outdated
|
||
#include <cuda/std/__cccl/cuda_capabilities.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This header is part of our global config and should not be included separately
#include <cuda/std/__cccl/cuda_capabilities.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included this because I need it for _CCCL_PDL_GRID_DEPENDENCY_SYNC()
. Which header file should I be including instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none, all of that is already in the cub config
Everything within cuda/std/__cccl
is supposed to be globally available so every config includes it first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline but the issue is that this header has not been included into the cccl config, so I believe that this is fine for now
#include <iterator> | ||
#include <type_traits> | ||
#include <utility> | ||
#include <cuda/std/iterator> // Needed for __gnu_cxx::__normal_iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change requested, but all of this needs to go an be replaced by ::cuda::std::contiguous_iterator
…ct-merge-sort-to-nvrtc-header
🟩 CI finished in 1h 51m: Pass: 100%/78 | Total: 2d 06h | Avg: 42m 03s | Max: 1h 13m | Hits: 241%/12784
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 78)
# | Runner |
---|---|
53 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 2h 32m: Pass: 100%/78 | Total: 2d 10h | Avg: 44m 55s | Max: 1h 19m | Hits: 102%/12784
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 78)
# | Runner |
---|---|
53 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 2h 01m: Pass: 100%/78 | Total: 2d 08h | Avg: 43m 34s | Max: 1h 11m | Hits: 194%/12784
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 78)
# | Runner |
---|---|
53 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
…ct-merge-sort-to-nvrtc-header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a lot of thrust functionality being moved into individual headers. Is there a technical reason for this? Is it because of NVRTC?
template <typename PolicyT, int BLOCK_THREADS_, int ITEMS_PER_THREAD_ = PolicyT::ITEMS_PER_THREAD> | ||
struct policy_wrapper_t : PolicyT | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Why does this struct need to be moved into a separate file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is due to NVRTC. Specifically, merge_sort.cuh
needs policy_wrapper_t
but I can't include all of util_device.cuh
due to NVRTC errors which I could not resolve easily, so my strategy with this and similar situations was to move things to separate headers. For the most part, I tried to fix the NVRTC errors but when that failed I resorted to this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played a bit with NVRTC, and getting util_device.cuh
to compile is surprisingly easy (you just #ifdef
out everything from the header except policy_wrapper_t
). So I guess pulling out that part makes a lot of sense!
I also tried getting the merge sort headers to compile, but I got stuck at thrust iterators, as always. This is what I meant today in the team meeting that we need an overhaul here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I wasn't aware I could do that. Would you like me to implement this instead of what I have now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just pull out policy_wrapper_t
, it's fine!
Yeah see #3438 (comment) |
🟩 CI finished in 2h 38m: Pass: 100%/78 | Total: 2d 09h | Avg: 43m 52s | Max: 1h 13m | Hits: 172%/12772
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 78)
# | Runner |
---|---|
53 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
…ct-merge-sort-to-nvrtc-header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work
🟩 CI finished in 4h 11m: Pass: 100%/78 | Total: 2d 07h | Avg: 42m 40s | Max: 1h 15m | Hits: 164%/12772
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 78)
# | Runner |
---|---|
53 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
* Move merge_sort kernels to separate file * Add merge_sort nvrtc test * Remove include that contains host code and replace with cuda::std * Remove unneeded headers from merge_sort header * Move LoadIterator to separate header and replace include * Add host device macro to has_nested_type to fix nvrtc issue * Extract make_load_iterator into separate file to avoid nvrtc error * Extract is_thrust_pointer into separate file to avoid nvrtc error * Extract policy_wrapper_t into separate file, forward declare LoadIterator, and use ::cuda::std instead of std to avoid nvrtc errors * Extract unwrap_contiguous_iterator into separate file to avoid nvrtc errors * Add missing include following header reorganization * Add comment explaining why we forward declare make_load_iterator * Add missing iterator include * Add missing thrust config include * Use is_same_v and rearrange include according to formatter * Add missing comment to endif * Use SPDX license instead of longer one * Use nested namespace specifier * Use nested namespace specifiers and _v suffix in other files
Description
Closes #3386
Similar to #2231 and #3334, this PR extracts
DeviceMergeSortBlockSortKernel
,DeviceMergeSortPartitionKernel
, andDeviceMergeSortMergeKernel
into an NVRTC compilable header.Checklist