-
Notifications
You must be signed in to change notification settings - Fork 207
Implement cuda::isqrt
#4427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implement cuda::isqrt
#4427
Conversation
_Up __current{0}; | ||
_Up __next{_Up(_Up{1} << ((_CUDA_VSTD::bit_width(_Up(__v - 1)) + 1) / 2))}; | ||
|
||
do | ||
{ | ||
__current = __next; | ||
__next = _Up((__current + _Up(__v) / __current) / 2); | ||
} while (__next < __current); | ||
|
||
return static_cast<_Tp>(__current); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just cast to the respective floating point and then cast back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's tricky. Conversion to float
is worth it only if --prec-sqrt=false
is used...on the other hand, this cannot be detected. Need to check with the compiler team
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other two notes: PTX sqrt
can be used to directly use the fast math (approx) mode. We need to be very careful on number >= 2^23
because of the loss of precision in the conversion to floating point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented the digit-by-digit algorithm that only does bit shifts and +- operations and does nbits / 2 - 1
steps maximum, check it out!
/ok to test 8730622 |
🟩 CI finished in 2h 22m: Pass: 100%/170 | Total: 3d 16h | Avg: 31m 18s | Max: 1h 37m | Hits: 69%/269901
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
stdpar | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 170)
# | Runner |
---|---|
121 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
12 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
} | ||
|
||
_Up __current{0}; | ||
_Up __next{_Up(_Up{1} << ((_CUDA_VSTD::bit_width(_Up(__v - 1)) + 1) / 2))}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uniform initialization with dynamic values is a GCC extension. Also, I would prefere static_cast
over C-style cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit_width
returns an int. We should convert it to unsigned to make the division more efficient
_Up __current{0}; | ||
_Up __next{_Up(_Up{1} << ((_CUDA_VSTD::bit_width(_Up(__v - 1)) + 1) / 2))}; | ||
|
||
do | ||
{ | ||
__current = __next; | ||
__next = _Up((__current + _Up(__v) / __current) / 2); | ||
} while (__next < __current); | ||
|
||
return static_cast<_Tp>(__current); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's tricky. Conversion to float
is worth it only if --prec-sqrt=false
is used...on the other hand, this cannot be detected. Need to check with the compiler team
_Up __current{0}; | ||
_Up __next{_Up(_Up{1} << ((_CUDA_VSTD::bit_width(_Up(__v - 1)) + 1) / 2))}; | ||
|
||
do | ||
{ | ||
__current = __next; | ||
__next = _Up((__current + _Up(__v) / __current) / 2); | ||
} while (__next < __current); | ||
|
||
return static_cast<_Tp>(__current); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other two notes: PTX sqrt
can be used to directly use the fast math (approx) mode. We need to be very careful on number >= 2^23
because of the loss of precision in the conversion to floating point.
please also check the developer forum discussion https://forums.developer.nvidia.com/t/integer-square-root/198642 |
/ok to test 9ed5906 |
🟨 CI finished in 2h 04m: Pass: 74%/170 | Total: 3d 17h | Avg: 31m 25s | Max: 1h 42m | Hits: 49%/153708
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
stdpar | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 170)
# | Runner |
---|---|
121 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
12 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
/ok to test 6fd7821 |
🟩 CI finished in 2h 26m: Pass: 100%/170 | Total: 3d 18h | Avg: 32m 05s | Max: 1h 34m | Hits: 69%/269901
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
stdpar | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 170)
# | Runner |
---|---|
121 | linux-amd64-cpu16 |
15 | windows-amd64-cpu16 |
12 | linux-arm64-cpu16 |
8 | linux-amd64-gpu-rtx2080-latest-1 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
5 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
This PR introduces
cuda::isqrt
function which computes the integer square root of a given input.The implementation is based on reference implementation from P3605R0 proposal, I am a bit unsure whether I am able to reuse it.