-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Scan implementation for c.parallel #3462
Add Scan implementation for c.parallel #3462
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
811ba03
to
ca6df87
Compare
26582cf
to
6534ed7
Compare
d66dbeb
to
0615c85
Compare
🟨 CI finished in 3h 26m: Pass: 98%/90 | Total: 1d 18h | Avg: 28m 38s | Max: 1h 08m | Hits: 331%/10928
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟨 CI finished in 4h 25m: Pass: 98%/90 | Total: 1d 18h | Avg: 28m 39s | Max: 1h 08m | Hits: 331%/10928
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
0615c85
to
0aa00e7
Compare
c/parallel/src/scan.cu
Outdated
if (accum_t.type == cccl_type_enum::STORAGE) | ||
{ | ||
std::string src = std::format( | ||
"#include <cub/agent/single_pass_scan_operators.cuh>\n" | ||
"#include <cub/util_type.cuh>\n" | ||
"struct __align__({1}) storage_t {{\n" | ||
" char data[{0}];\n" | ||
"}};\n" | ||
"__device__ size_t status_size = sizeof(typename cub::ScanTileState<{2}>::StatusWord);\n" | ||
"__device__ size_t uninitialized_size = sizeof(cub::Uninitialized<{2}>);\n", | ||
accum_t.size, | ||
accum_t.alignment, | ||
accum_cpp); | ||
auto compile_result = | ||
make_nvrtc_command_list() | ||
.add_program(nvrtc_translation_unit{src.c_str(), "tile_state_info"}) | ||
.compile_program({ptx_args, num_ptx_args}) | ||
.cleanup_program() | ||
.finalize_program(num_ptx_lto_args, ptx_lopts); | ||
auto ptx_code = compile_result.cubin.get(); | ||
status_size = scan::find_size_t(ptx_code, "status_size"); | ||
uninitialized_size = scan::find_size_t(ptx_code, "uninitialized_size"); | ||
} | ||
else | ||
{ | ||
std::string src = std::format( | ||
"#include <cub/agent/single_pass_scan_operators.cuh>\n" | ||
"#include <cub/util_type.cuh>\n" | ||
"__device__ size_t status_size = sizeof(typename cub::ScanTileState<{0}>::TxnWord);\n", | ||
accum_cpp); | ||
auto compile_result = | ||
make_nvrtc_command_list() | ||
.add_program(nvrtc_translation_unit{src.c_str(), "tile_state_info"}) | ||
.compile_program({ptx_args, num_ptx_args}) | ||
.cleanup_program() | ||
.finalize_program(num_ptx_lto_args, ptx_lopts); | ||
auto ptx_code = compile_result.cubin.get(); | ||
status_size = scan::find_size_t(ptx_code, "status_size"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: the assumption that storage type leads to particular tile state specialization is what we tried to avoid with ptx queries. Later on, we'll have decoupled look back optimization that makes small and trivial user defined type still fit into a single architectural word. Once this optimization is merged, it'd be hard to track bug in cuda.parallel. I think we can come up with another level of indirection. For each scan tile state specialization, we can add two static member variables. One specialization would have:
static constexpr size_t description_bytes_per_tile = sizeof(TxnWord);
static constexpr size_t payload_bytes_per_tile = 0;
and another one would be:
static constexpr size_t description_bytes_per_tile = sizeof(StatusWord);
static constexpr size_t payload_bytes_per_tile = sizeof(Uninitialized<T>);
Then we can add a detail allocation size query:
_CCCL_HOST_DEVICE _CCCL_FORCEINLINE constexpr int num_tiles_to_num_tile_states(int num_tiles)
{
return CUB_PTX_WARP_THREADS + num_tiles;
}
_CCCL_HOST_DEVICE _CCCL_FORCEINLINE size_t
allocation_size(int bytes_per_description, int bytes_per_payload, int num_tiles)
{
// Specify storage allocation requirements
size_t allocation_sizes[3];
int num_tile_states = num_tiles_to_num_tile_states(num_tiles);
// bytes needed for tile status descriptors
allocation_sizes[0] = num_tile_states * bytes_per_description;
// bytes needed for partials
allocation_sizes[1] = num_tile_states * bytes_per_payload;
// bytes needed for inclusives
allocation_sizes[2] = num_tile_states * bytes_per_payload;
// Set the necessary size of the blob
size_t temp_storage_bytes = 0;
void* allocations[3] = {};
AliasTemporaries(nullptr, temp_storage_bytes, allocations, allocation_sizes);
return temp_storage_bytes;
}
That should work in both cases as:
_CCCL_HOST_DEVICE _CCCL_FORCEINLINE static constexpr cudaError_t
AllocationSize(int num_tiles, size_t& temp_storage_bytes)
{
temp_storage_bytes = detail::allocation_size(description_bytes_per_tile, payload_bytes_per_tile, num_tiles);
return cudaSuccess;
}
Regarding the highlighted code, we'd end up having just one branch that doesn't assume much about the code:
"__device__ size_t description_size = cub::ScanTileState<{2}>::description_bytes_per_tile;\n"
"__device__ size_t payload_size = cub::ScanTileState<{2}>::payload_bytes_per_tile;\n",
This would allow us to remove is_primitive
member variable and simplify C Parallel tile state wrapper. The only thing to note in this is that ptx global variables are initialized to zero by default. Since we'll end up having payload bytes equal zero in one of the cases, PTX might loose initializer, and scan::find_size_t
will stop working. To workaround that, I'd suggest to add another find_size_t(char* ptx, std::string_view name, size_t default)
overload, and search for payload
variable as scan_find_size_t(ptx, "payload_size", 0)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok to test |
🟨 CI finished in 1h 30m: Pass: 98%/89 | Total: 2d 14h | Avg: 42m 19s | Max: 1h 11m | Hits: 170%/10936
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 89)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
8 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
fd4bdbd
to
3107a7f
Compare
Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
9f4468f
to
9fe8dfe
Compare
🟩 CI finished in 1h 34m: Pass: 100%/90 | Total: 2d 14h | Avg: 41m 24s | Max: 1h 21m | Hits: 70%/129364
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
🟩 CI finished in 2h 58m: Pass: 100%/90 | Total: 2d 13h | Avg: 41m 10s | Max: 1h 20m | Hits: 71%/129364
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
+/- | CCCL C Parallel Library |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
Description
Closes #2544
This PR adds the
c.parallel
scan API. Reviewers please note: it's easiest to review this PR commit-by-commit. The largest and most significant commit isAdd scan c.parallel API
. The commits prior to that are largely preparatory work (but still require review).Checklist