Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add b200 policies for partition.three_way #3708

Merged
merged 1 commit into from
Feb 6, 2025

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Feb 6, 2025

Pulled out the tunings which are already approved from #3617 to make progress.

Comment on lines 233 to 253
// template <class Input, class OffsetT>
// struct sm100_tuning<Input, OffsetT, input_size::_1, offset_size::_4>
// {
// // trp_0.ipt_12.tpb_256.ns_792.dcid_6.l2w_365 1.063960 0.978016 1.072833 1.301435
// static constexpr int items = 12;
// static constexpr int threads = 256;
// static constexpr BlockLoadAlgorithm load_algorithm = BLOCK_LOAD_DIRECT;
// using delay_constructor = exponential_backon_jitter_constructor_t<792, 365>;
// };

// template <class Input, class OffsetT>
// struct sm100_tuning<Input, OffsetT, input_size::_2, offset_size::_4>
// {
// // trp_1.ipt_14.tpb_288.ns_496.dcid_6.l2w_400 1.170449 1.123515 1.170428 1.252066
// static constexpr int items = 14;
// static constexpr int threads = 288;
// static constexpr BlockLoadAlgorithm load_algorithm = BLOCK_LOAD_WARP_TRANSPOSE;
// using delay_constructor = exponential_backon_jitter_constructor_t<496, 400>;
// };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be explicit and name the same as SM90 as in other tunings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, the less I like constructs like:

// default back to SM90 tuning
template <....>
struct sm100_tuning<...> : sm90_tuning<...> {};

Tunings can work differently for each algorithm or architecture, so here sm100_tuning can have different template arguments than sm90_tuning, or different data members. Also, if sm100_tuning had a ::value that could be interpreted differently than sm90_tuning::value, by the selection logic in the policy hub. Therefore, IMO, the best is to not provide a template specialization at all and let SFINAE not find an sm100_tuning and fall back.

But I could add a comment here.

@bernhardmgruber bernhardmgruber enabled auto-merge (squash) February 6, 2025 12:17
Copy link
Contributor

github-actions bot commented Feb 6, 2025

🟩 CI finished in 1h 23m: Pass: 100%/90 | Total: 19h 18m | Avg: 12m 52s | Max: 39m 52s | Hits: 94%/132225
  • 🟩 cub: Pass: 100%/44 | Total: 12h 10m | Avg: 16m 36s | Max: 39m 52s | Hits: 92%/52320

    🟩 cpu
      🟩 amd64              Pass: 100%/42  | Total: 11h 40m | Avg: 16m 41s | Max: 39m 52s | Hits:  92%/49888 
      🟩 arm64              Pass: 100%/2   | Total: 29m 56s | Avg: 14m 58s | Max: 15m 17s | Hits:  99%/2432  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 25m | Avg: 17m 10s | Max: 32m 59s | Hits:  84%/5914  
      🟩 12.5               Pass: 100%/2   | Total: 41m 02s | Avg: 20m 31s | Max: 21m 43s | Hits:  95%/2250  
      🟩 12.8               Pass: 100%/37  | Total: 10h 03m | Avg: 16m 19s | Max: 39m 52s | Hits:  93%/44156 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 19m 41s | Avg:  9m 50s | Max: 10m 20s | Hits:  99%/2104  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 25m | Avg: 17m 10s | Max: 32m 59s | Hits:  84%/5914  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 41m 02s | Avg: 20m 31s | Max: 21m 43s | Hits:  95%/2250  
      🟩 nvcc12.8           Pass: 100%/35  | Total:  9h 44m | Avg: 16m 41s | Max: 39m 52s | Hits:  93%/42052 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 19m 41s | Avg:  9m 50s | Max: 10m 20s | Hits:  99%/2104  
      🟩 nvcc               Pass: 100%/42  | Total: 11h 51m | Avg: 16m 55s | Max: 39m 52s | Hits:  92%/50216 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 49m 53s | Avg: 12m 28s | Max: 13m 25s | Hits:  99%/4872  
      🟩 Clang15            Pass: 100%/2   | Total: 24m 56s | Avg: 12m 28s | Max: 12m 42s | Hits:  99%/2432  
      🟩 Clang16            Pass: 100%/2   | Total: 25m 29s | Avg: 12m 44s | Max: 13m 15s | Hits:  99%/2432  
      🟩 Clang17            Pass: 100%/2   | Total: 23m 52s | Avg: 11m 56s | Max: 11m 58s | Hits:  99%/2432  
      🟩 Clang18            Pass: 100%/7   | Total:  1h 45m | Avg: 15m 04s | Max: 25m 44s | Hits:  99%/8184  
      🟩 GCC7               Pass: 100%/2   | Total: 24m 46s | Avg: 12m 23s | Max: 12m 42s | Hits:  99%/2436  
      🟩 GCC8               Pass: 100%/1   | Total: 13m 24s | Avg: 13m 24s | Max: 13m 24s | Hits:  99%/1218  
      🟩 GCC9               Pass: 100%/2   | Total: 26m 57s | Avg: 13m 28s | Max: 14m 19s | Hits:  99%/2436  
      🟩 GCC10              Pass: 100%/2   | Total: 26m 13s | Avg: 13m 06s | Max: 13m 46s | Hits:  99%/2436  
      🟩 GCC11              Pass: 100%/2   | Total: 24m 32s | Avg: 12m 16s | Max: 12m 35s | Hits:  99%/2432  
      🟩 GCC12              Pass: 100%/2   | Total: 25m 06s | Avg: 12m 33s | Max: 12m 42s | Hits:  99%/2432  
      🟩 GCC13              Pass: 100%/10  | Total:  2h 51m | Avg: 17m 06s | Max: 26m 11s | Hits:  99%/12160 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 10m | Avg: 35m 09s | Max: 37m 20s | Hits:  15%/2084  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  1h 17m | Avg: 38m 48s | Max: 39m 52s | Hits:  15%/2084  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 41m 02s | Avg: 20m 31s | Max: 21m 43s | Hits:  95%/2250  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  3h 49m | Avg: 13m 30s | Max: 25m 44s | Hits:  99%/20352 
      🟩 GCC                Pass: 100%/21  | Total:  5h 12m | Avg: 14m 51s | Max: 26m 11s | Hits:  99%/25550 
      🟩 MSVC               Pass: 100%/4   | Total:  2h 27m | Avg: 36m 58s | Max: 39m 52s | Hits:  15%/4168  
      🟩 NVHPC              Pass: 100%/2   | Total: 41m 02s | Avg: 20m 31s | Max: 21m 43s | Hits:  95%/2250  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 31m 26s | Avg: 15m 43s | Max: 26m 11s | Hits:  99%/2432  
      🟩 rtx2080            Pass: 100%/34  | Total:  9h 01m | Avg: 15m 55s | Max: 39m 52s | Hits:  90%/40160 
      🟩 rtxa6000           Pass: 100%/8   | Total:  2h 37m | Avg: 19m 42s | Max: 25m 44s | Hits:  99%/9728  
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total:  9h 32m | Avg: 15m 27s | Max: 39m 52s | Hits:  91%/43808 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 05s | Avg: 22m 05s | Max: 22m 05s | Hits:  99%/1216  
      🟩 GraphCapture       Pass: 100%/1   | Total: 16m 04s | Avg: 16m 04s | Max: 16m 04s | Hits:  99%/1216  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 17m | Avg: 25m 48s | Max: 26m 11s | Hits:  99%/3648  
      🟩 TestGPU            Pass: 100%/2   | Total: 43m 01s | Avg: 21m 30s | Max: 22m 34s | Hits:  99%/2432  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 31m 26s | Avg: 15m 43s | Max: 26m 11s | Hits:  99%/2432  
      🟩 90;90a;100         Pass: 100%/1   | Total: 13m 00s | Avg: 13m 00s | Max: 13m 00s | Hits:  98%/1216  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total:  5h 26m | Avg: 16m 19s | Max: 37m 44s | Hits:  88%/23559 
      🟩 20                 Pass: 100%/24  | Total:  6h 44m | Avg: 16m 50s | Max: 39m 52s | Hits:  96%/28761 
    
  • 🟩 thrust: Pass: 100%/43 | Total: 6h 30m | Avg: 9m 05s | Max: 30m 56s | Hits: 96%/79625

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 17m 50s | Avg:  8m 55s | Max: 11m 32s | Hits:  99%/3706  
    🟩 cpu
      🟩 amd64              Pass: 100%/41  | Total:  6h 20m | Avg:  9m 16s | Max: 30m 56s | Hits:  96%/75920 
      🟩 arm64              Pass: 100%/2   | Total: 10m 01s | Avg:  5m 00s | Max:  5m 15s | Hits:  99%/3705  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total: 46m 44s | Avg:  9m 20s | Max: 25m 59s | Hits:  93%/9256  
      🟩 12.5               Pass: 100%/2   | Total: 28m 52s | Avg: 14m 26s | Max: 15m 04s | Hits:  99%/3704  
      🟩 12.8               Pass: 100%/36  | Total:  5h 15m | Avg:  8m 45s | Max: 30m 56s | Hits:  96%/66665 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 40s | Avg:  5m 20s | Max:  5m 38s | Hits: 100%/3704  
      🟩 nvcc12.0           Pass: 100%/5   | Total: 46m 44s | Avg:  9m 20s | Max: 25m 59s | Hits:  93%/9256  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 28m 52s | Avg: 14m 26s | Max: 15m 04s | Hits:  99%/3704  
      🟩 nvcc12.8           Pass: 100%/34  | Total:  5h 04m | Avg:  8m 57s | Max: 30m 56s | Hits:  96%/62961 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 40s | Avg:  5m 20s | Max:  5m 38s | Hits: 100%/3704  
      🟩 nvcc               Pass: 100%/41  | Total:  6h 19m | Avg:  9m 16s | Max: 30m 56s | Hits:  96%/75921 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 20m 50s | Avg:  5m 12s | Max:  5m 33s | Hits: 100%/7408  
      🟩 Clang15            Pass: 100%/2   | Total: 11m 21s | Avg:  5m 40s | Max:  5m 47s | Hits: 100%/3704  
      🟩 Clang16            Pass: 100%/2   | Total: 12m 10s | Avg:  6m 05s | Max:  6m 07s | Hits: 100%/3704  
      🟩 Clang17            Pass: 100%/2   | Total: 11m 16s | Avg:  5m 38s | Max:  5m 46s | Hits: 100%/3704  
      🟩 Clang18            Pass: 100%/7   | Total: 46m 20s | Avg:  6m 37s | Max: 10m 50s | Hits: 100%/12964 
      🟩 GCC7               Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  5m 59s | Hits:  99%/3706  
      🟩 GCC8               Pass: 100%/1   | Total:  5m 26s | Avg:  5m 26s | Max:  5m 26s | Hits:  99%/1853  
      🟩 GCC9               Pass: 100%/2   | Total: 11m 03s | Avg:  5m 31s | Max:  5m 33s | Hits:  99%/3706  
      🟩 GCC10              Pass: 100%/2   | Total: 12m 10s | Avg:  6m 05s | Max:  6m 24s | Hits:  99%/3706  
      🟩 GCC11              Pass: 100%/2   | Total: 12m 19s | Avg:  6m 09s | Max:  6m 12s | Hits:  99%/3706  
      🟩 GCC12              Pass: 100%/2   | Total: 12m 24s | Avg:  6m 12s | Max:  6m 21s | Hits:  99%/3706  
      🟩 GCC13              Pass: 100%/8   | Total:  1h 02m | Avg:  7m 46s | Max: 11m 56s | Hits:  99%/14824 
      🟩 MSVC14.29          Pass: 100%/2   | Total: 49m 46s | Avg: 24m 53s | Max: 25m 59s | Hits:  69%/3692  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  1h 23m | Avg: 27m 46s | Max: 30m 56s | Hits:  69%/5538  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 28m 52s | Avg: 14m 26s | Max: 15m 04s | Hits:  99%/3704  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  1h 41m | Avg:  5m 59s | Max: 10m 50s | Hits: 100%/31484 
      🟩 GCC                Pass: 100%/19  | Total:  2h 06m | Avg:  6m 40s | Max: 11m 56s | Hits:  99%/35207 
      🟩 MSVC               Pass: 100%/5   | Total:  2h 13m | Avg: 26m 37s | Max: 30m 56s | Hits:  69%/9230  
      🟩 NVHPC              Pass: 100%/2   | Total: 28m 52s | Avg: 14m 26s | Max: 15m 04s | Hits:  99%/3704  
    🟩 gpu
      🟩 rtx2080            Pass: 100%/33  | Total:  4h 21m | Avg:  7m 55s | Max: 25m 59s | Hits:  97%/61112 
      🟩 rtx4090            Pass: 100%/10  | Total:  2h 09m | Avg: 12m 54s | Max: 30m 56s | Hits:  93%/18513 
    🟩 jobs
      🟩 Build              Pass: 100%/37  | Total:  5h 08m | Avg:  8m 20s | Max: 28m 03s | Hits:  96%/68516 
      🟩 TestCPU            Pass: 100%/3   | Total: 47m 48s | Avg: 15m 56s | Max: 30m 56s | Hits:  89%/5551  
      🟩 TestGPU            Pass: 100%/3   | Total: 34m 18s | Avg: 11m 26s | Max: 11m 56s | Hits:  99%/5558  
    🟩 sm
      🟩 90;90a;100         Pass: 100%/1   | Total:  6m 20s | Avg:  6m 20s | Max:  6m 20s | Hits:  99%/1853  
    🟩 std
      🟩 17                 Pass: 100%/20  | Total:  3h 00m | Avg:  9m 00s | Max: 25m 59s | Hits:  95%/37031 
      🟩 20                 Pass: 100%/21  | Total:  3h 12m | Avg:  9m 10s | Max: 30m 56s | Hits:  97%/38888 
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 01s | Avg: 4m 00s | Max: 5m 39s | Hits: 98%/280

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 01s | Avg:  4m 00s | Max:  5m 39s | Hits:  98%/280   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total:  8m 01s | Avg:  4m 00s | Max:  5m 39s | Hits:  98%/280   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total:  8m 01s | Avg:  4m 00s | Max:  5m 39s | Hits:  98%/280   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 01s | Avg:  4m 00s | Max:  5m 39s | Hits:  98%/280   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 01s | Avg:  4m 00s | Max:  5m 39s | Hits:  98%/280   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 01s | Avg:  4m 00s | Max:  5m 39s | Hits:  98%/280   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total:  8m 01s | Avg:  4m 00s | Max:  5m 39s | Hits:  98%/280   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 22s | Avg:  2m 22s | Max:  2m 22s | Hits:  98%/140   
      🟩 Test               Pass: 100%/1   | Total:  5m 39s | Avg:  5m 39s | Max:  5m 39s | Hits:  98%/140   
    
  • 🟩 python: Pass: 100%/1 | Total: 29m 04s | Avg: 29m 04s | Max: 29m 04s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 29m 04s | Avg: 29m 04s | Max: 29m 04s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 29m 04s | Avg: 29m 04s | Max: 29m 04s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 29m 04s | Avg: 29m 04s | Max: 29m 04s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 29m 04s | Avg: 29m 04s | Max: 29m 04s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 29m 04s | Avg: 29m 04s | Max: 29m 04s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 29m 04s | Avg: 29m 04s | Max: 29m 04s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 29m 04s | Avg: 29m 04s | Max: 29m 04s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 29m 04s | Avg: 29m 04s | Max: 29m 04s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 90)

# Runner
65 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1
1 linux-amd64-gpu-h100-latest-1

@bernhardmgruber bernhardmgruber merged commit 9b7333b into NVIDIA:main Feb 6, 2025
102 of 105 checks passed
github-actions bot pushed a commit that referenced this pull request Feb 6, 2025
Copy link
Contributor

github-actions bot commented Feb 6, 2025

Successfully created backport PR for branch/2.8.x:

miscco added a commit that referenced this pull request Feb 6, 2025
(cherry picked from commit 9b7333b)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>
@bernhardmgruber bernhardmgruber deleted the tune_partition_three branch February 6, 2025 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants