-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize bit_floor
, bit_ceil
, bit_width
#3296
Conversation
bit_floor
, bit_ceil
, bit_width
bit_floor
, bit_ceil
, bit_width
libcudacxx/include/cuda/std/__type_traits/is_unsigned_integer.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__type_traits/is_unsigned_integer.h
Outdated
Show resolved
Hide resolved
Co-authored-by: Wesley Maxey <[email protected]>
bit_floor
, bit_ceil
, bit_width
bit_floor
, bit_ceil
, bit_width
🟨 CI finished in 1h 14m: Pass: 43%/158 | Total: 2d 04h | Avg: 20m 03s | Max: 1h 11m | Hits: 61%/87192
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
# | Runner |
---|---|
111 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
🟨 CI finished in 1h 36m: Pass: 97%/158 | Total: 3d 03h | Avg: 28m 31s | Max: 1h 26m | Hits: 70%/237980
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
# | Runner |
---|---|
111 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 35m: Pass: 100%/158 | Total: 3d 01h | Avg: 27m 56s | Max: 1h 17m | Hits: 73%/248320
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
# | Runner |
---|---|
111 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
// #include <cuda/__ptx/instructions/shl.h> | ||
// #include <cuda/__ptx/instructions/shr.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove commented out includes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR is marked blocked because it depends on these two instructions
+ (numeric_limits<unsigned>::digits - numeric_limits<_Tp>::digits))) | ||
>> (numeric_limits<unsigned>::digits - numeric_limits<_Tp>::digits)); | ||
// if __t == 0, __bit_log2(0) returns 0xFFFFFFFF. Since unsigned overflow is well-defined, the result is -1 + 1 = 0 | ||
auto __ret = _CUDA_VSTD::__bit_log2(__t) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment and the code do not agree. __bit_log2
is returning an int
so this would be signed overflow aka UB.
Please add the appropriate casts if you want to intermittently cast to unsigned
if (!_CUDA_VSTD::__cccl_default_is_constant_evaluated() && sizeof(_Tp) <= 8 && false) | ||
{ | ||
// CUDA right shift (ptx::shr) returns 0 if the right operand is larger than the number of bits of the type | ||
// The result is computed as max(1, bit_width(__t - 1)) because it is more efficient than ternary operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file a backend bug for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a bug. Recent GPU archs provide MNMX
instructions to compute minimum and maximum efficiently. The ternary operator has a different semantic. I don't think the compiler is able to understand the program logic enough to exploit this optimization.
bit_floor
, bit_ceil
, bit_width
bit_floor
, bit_ceil
, bit_width
bit_floor
, bit_ceil
, bit_width
bit_floor
, bit_ceil
, bit_width
|
🟩 CI finished in 1h 30m: Pass: 100%/158 | Total: 3d 01h | Avg: 27m 54s | Max: 1h 18m | Hits: 76%/248762
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
# | Runner |
---|---|
111 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 33m: Pass: 100%/158 | Total: 3d 00h | Avg: 27m 27s | Max: 1h 18m | Hits: 76%/248762
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 158)
# | Runner |
---|---|
111 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
Fixes #2239
Description
Optimize
bit_floor
,bit_ceil
,bit_width
Features:
nodiscard
,noexcept
bit_ceil
bfind
and relying on the shift behavior when the amount is larger than the number of bits (CUDA)Requires: #3414