-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add b200 tunings for scan.exclusive.sum #3559
Add b200 tunings for scan.exclusive.sum #3559
Conversation
bernhardmgruber
commented
Jan 28, 2025
•
edited
Loading
edited
- Perf diff for scan on B200 before and after this PR
24683b8
to
7c50dcc
Compare
🟨 CI finished in 4h 07m: Pass: 96%/90 | Total: 2d 14h | Avg: 41m 55s | Max: 1h 12m | Hits: 262%/10928
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
9 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
7c50dcc
to
2e2caa7
Compare
🟩 CI finished in 2h 30m: Pass: 100%/89 | Total: 15h 26m | Avg: 10m 24s | Max: 57m 52s | Hits: 422%/10928
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 89)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
8 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
2e2caa7
to
3f68101
Compare
🟨 CI finished in 1h 39m: Pass: 98%/90 | Total: 2d 16h | Avg: 42m 41s | Max: 1h 24m | Hits: 248%/13398
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
3f68101
to
65ef2c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the tuning selection logic to be more akin to what the benchmark does. @gevtushenko I would like your review here. I remember we discussed this at some point and you had a story why AccumT was the right think to check here, but I think we actually need to check both, AccumT and ValueT.
// Only consider sm100 tunings if the accumulator size matches the one we use in the benchmarks | ||
using benchmark_accum_t = ::cuda::std::__accumulator_t<ScanOpT, ValueT, ValueT>; | ||
static constexpr bool accum_size_match = classify_accum_size<AccumT>() == classify_accum_size<benchmark_accum_t>(); | ||
|
||
using ScanPolicyT = ::cuda::std::conditional_t< | ||
accum_size_match, | ||
decltype(select_agent_policy100<sm100_tuning<ValueT, AccumT, OffsetT, classify_op<ScanOpT>()>>(0)), | ||
typename Policy900::ScanPolicyT>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we check whether the AccumT
matches what we would use in the benchmark. If it does, we take a sm100_tuning
, otherwise we fallback to whatever Policy900
did.
template <class ValueT, class AccumT, class OffsetT> | ||
struct sm100_tuning<ValueT, | ||
AccumT, | ||
OffsetT, | ||
op_type::plus, | ||
primitive_value::yes, | ||
primitive_accum::yes, | ||
offset_size::_4, | ||
value_size::_1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compared with the sm90 tunings, we switch on the value_size
here, no the accum_size
, because that's what we actually also iterate in the benchmark. We do not check the size of AccumT
here, because we already checked previously whether that size corresponds to the size we would have in the benchmark.
|
2971ed6
to
de36890
Compare
Co-authored-by: Georgii Evtushenko <[email protected]>
After discussion with Georgii
I diffed the SASS for SM100 from the commit on which @gonidelis did his benchmark to the tip of this PR including all my tuning logic selection changes, and nothing changed except kernel symbol names. I therefore conclude that @gonidelis benchmark is still valid. |
8502137
to
622450c
Compare
I dropped the max tunings because that is not a known operator to CUB. See clarification here: #3709 |
622450c
to
9df7a86
Compare
🟩 CI finished in 1h 07m: Pass: 100%/90 | Total: 23h 31m | Avg: 15m 40s | Max: 37m 35s | Hits: 89%/132225
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin branch/2.8.x
git worktree add -d .worktree/backport-3559-to-branch/2.8.x origin/branch/2.8.x
cd .worktree/backport-3559-to-branch/2.8.x
git switch --create backport-3559-to-branch/2.8.x
git cherry-pick -x 25523da2f942a045facfe2ec6839f448c60c2c4e |
* Drop unused struct * Refactor * Clarify input type in scan benchmark * Redesign scan policy selection after discussion with Georgii Co-authored-by: Giannis Gonidelis <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]>
* Drop unused struct * Refactor * Clarify input type in scan benchmark * Redesign scan policy selection after discussion with Georgii Co-authored-by: Giannis Gonidelis <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]>
* Drop unused struct * Refactor * Clarify input type in scan benchmark * Redesign scan policy selection after discussion with Georgii Co-authored-by: Giannis Gonidelis <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]>