-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize, Cleanup, and Expose CUB Thread-Level Reduction #2390
Conversation
Please stay with tthe current optimizations, and create individual PRs for follow up changes. Especially with performance improvements it is important to be able to bisect small commits rather than one large "optimizes the world" one |
🟨 CI finished in 5h 09m: Pass: 81%/259 | Total: 5d 21h | Avg: 32m 39s | Max: 1h 16m | Hits: 62%/21490
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
# | Runner |
---|---|
186 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 2h 16m: Pass: 100%/208 | Total: 6d 01h | Avg: 41m 50s | Max: 1h 24m | Hits: 43%/14070
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 208)
# | Runner |
---|---|
171 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
12 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
🟩 CI finished in 2h 17m: Pass: 100%/208 | Total: 5d 22h | Avg: 41m 05s | Max: 1h 23m | Hits: 54%/14070
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 208)
# | Runner |
---|---|
171 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
12 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
🟩 CI finished in 1h 51m: Pass: 100%/208 | Total: 5d 22h | Avg: 41m 08s | Max: 1h 16m | Hits: 54%/14070
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 208)
# | Runner |
---|---|
171 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
12 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
🟨 CI finished in 1h 50m: Pass: 93%/364 | Total: 6d 04h | Avg: 24m 28s | Max: 1h 13m | Hits: 77%/22755
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 364)
# | Runner |
---|---|
297 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
24 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 2h 18m: Pass: 93%/364 | Total: 5d 17h | Avg: 22m 40s | Max: 1h 03m | Hits: 85%/22755
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 364)
# | Runner |
---|---|
297 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
24 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 1h 29m: Pass: 98%/364 | Total: 6d 07h | Avg: 24m 54s | Max: 1h 12m | Hits: 85%/22755
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 364)
# | Runner |
---|---|
297 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
24 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 50m 43s: Pass: 100%/364 | Total: 1d 12h | Avg: 5m 57s | Max: 37m 56s | Hits: 99%/25679
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 364)
# | Runner |
---|---|
297 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
24 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 1h 29m: Pass: 100%/364 | Total: 1d 12h | Avg: 6m 04s | Max: 54m 37s | Hits: 94%/25679
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 364)
# | Runner |
---|---|
297 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
24 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 1h 50m: Pass: 100%/366 | Total: 6d 16h | Avg: 26m 17s | Max: 1h 16m | Hits: 11%/27881
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CCCL C Parallel Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CCCL C Parallel Library |
🏃 Runner counts (total jobs: 366)
# | Runner |
---|---|
298 | linux-amd64-cpu16 |
28 | linux-arm64-cpu16 |
25 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
cub/cub/thread/thread_reduce.cuh
Outdated
(return (is_one_of<T, ::cuda::std::int32_t, ::cuda::std::uint32_t, ::cuda::std::int64_t, ::cuda::std::uint64_t> | ||
&& is_one_of<ReductionOp, cub::Min, cub::Max, cub::Sum, cub::BitAnd, cub::BitOr, cub::BitXor>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move that into a proper variable tempalte or struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhmm, not sure what you are referring to. It looks pretty readable to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the following solution make sense?
template <typename T, typename ReductionOp>
struct enable_ternary_reduction_sm90
{
static constexpr bool value =
cub::detail::is_one_of<T, ::cuda::std::int32_t, ::cuda::std::uint32_t, ::cuda::std::int64_t, ::cuda::std::uint64_t>
&& cub::detail::is_one_of<ReductionOp, cub::Min, cub::Max, cub::Sum, cub::BitAnd, cub::BitOr, cub::BitXor>();
};
# if defined(_CCCL_HAS_NVFP16)
template <typename ReductionOp>
struct enable_ternary_reduction_sm90<__half2, ReductionOp>
{
static constexpr bool value =
cub::detail::is_one_of<ReductionOp, cub::Min, cub::Max, SimdMin<__half>, SimdMax<__half>>();
};
# endif // defined(_CCCL_HAS_NVFP16)
# if defined(_CCCL_HAS_NVBF16)
template <typename ReductionOp>
struct enable_ternary_reduction_sm90<__nv_bfloat162, ReductionOp>
{
static constexpr bool value =
cub::detail::is_one_of<ReductionOp, cub::Min, cub::Max, SimdMin<__nv_bfloat16>, SimdMax<__nv_bfloat16>>();
};
# endif // defined(_CCCL_HAS_NVBF16)
cub/cub/thread/thread_reduce.cuh
Outdated
return ((is_one_of<T, ::cuda::std::int16_t, ::cuda::std::uint16_t>() && is_one_of<ReductionOp, cub::Min, cub::Max>()) | ||
# if defined(_CCCL_HAS_NVFP16) | ||
|| (::cuda::std::is_same<T, __half>::value && is_one_of<ReductionOp, cub::Min, cub::Max, cub::Sum, cub::Mul>()) | ||
# endif | ||
# if defined(_CCCL_HAS_NVBF16) | ||
|| (::cuda::std::is_same<T, __nv_bfloat16>::value && | ||
is_one_of<ReductionOp, cub::Min, cub::Max, cub::Sum, cub::Mul>()) | ||
# endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this technique really hard to read, why are we not using a struct that we specialize for half and friends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I don't exactly understand what you are proposing. Are you concerned by the #if defines
? Do you suggest to break this function into multiple template structs with specializations for different types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think?
template <typename T, typename ReductionOp>
struct enable_generic_simd_reduction_traits
{
static constexpr bool value = cub::detail::is_one_of<T, ::cuda::std::int16_t, ::cuda::std::uint16_t>()
&& cub::detail::is_one_of<ReductionOp, cub::Min, cub::Max>();
};
# if defined(_CCCL_HAS_NVFP16)
template <typename ReductionOp>
struct enable_generic_simd_reduction_traits<__half, ReductionOp>
{
static constexpr bool value = cub::detail::is_one_of<ReductionOp, cub::Min, cub::Max, cub::Sum, cub::Mul>();
};
# endif // defined(_CCCL_HAS_NVFP16)
# if defined(_CCCL_HAS_NVBF16)
template <typename ReductionOp>
struct enable_generic_simd_reduction_traits<__nv_bfloat16, ReductionOp>
{
static constexpr bool value = cub::detail::is_one_of<ReductionOp, cub::Min, cub::Max, cub::Sum, cub::Mul>();
};
# endif // defined(_CCCL_HAS_NVBF16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah
39ba5f5
to
9616009
Compare
Address #2287
Optimize, Cleanup, and Expose CUB Thread-Level Reduction
half
andbfloat16
ThreadReduce
incub::
namespacecatch2
test cases for all new cases