Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport to 2.8: B200 reduce tunings #3735

Merged

Conversation

bernhardmgruber
Copy link
Contributor

bernhardmgruber and others added 2 commits February 7, 2025 09:32
* Add b200 policies for cub.device.reduce.sum

* Add b200 policies for reduce.min

---------

Co-authored-by: Giannis Gonidelis <[email protected]>
Copy link
Contributor

github-actions bot commented Feb 7, 2025

🟩 CI finished in 1h 18m: Pass: 100%/95 | Total: 2d 15h | Avg: 40m 04s | Max: 1h 11m | Hits: 283%/10540
  • 🟩 cub: Pass: 100%/47 | Total: 1d 14h | Avg: 49m 45s | Max: 1h 11m | Hits: 399%/3132

    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total:  1d 13h | Avg: 49m 24s | Max:  1h 11m | Hits: 399%/3132  
      🟩 arm64              Pass: 100%/2   | Total:  1h 55m | Avg: 57m 50s | Max:  1h 00m
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  5h 55m | Avg: 50m 43s | Max:  1h 11m | Hits: 399%/783   
      🟩 12.5               Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 06m
      🟩 12.6               Pass: 100%/38  | Total:  1d 06h | Avg: 48m 42s | Max:  1h 09m | Hits: 400%/2349  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 03m
      🟩 nvcc11.1           Pass: 100%/7   | Total:  5h 55m | Avg: 50m 43s | Max:  1h 11m | Hits: 399%/783   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 06m
      🟩 nvcc12.6           Pass: 100%/36  | Total:  1d 04h | Avg: 48m 00s | Max:  1h 09m | Hits: 400%/2349  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 03m
      🟩 nvcc               Pass: 100%/45  | Total:  1d 12h | Avg: 49m 15s | Max:  1h 11m | Hits: 399%/3132  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  3h 26m | Avg: 51m 34s | Max:  1h 06m
      🟩 Clang10            Pass: 100%/1   | Total: 58m 33s | Avg: 58m 33s | Max: 58m 33s
      🟩 Clang11            Pass: 100%/1   | Total: 57m 05s | Avg: 57m 05s | Max: 57m 05s
      🟩 Clang12            Pass: 100%/1   | Total: 56m 21s | Avg: 56m 21s | Max: 56m 21s
      🟩 Clang13            Pass: 100%/1   | Total: 58m 15s | Avg: 58m 15s | Max: 58m 15s
      🟩 Clang14            Pass: 100%/1   | Total: 57m 25s | Avg: 57m 25s | Max: 57m 25s
      🟩 Clang15            Pass: 100%/1   | Total: 53m 40s | Avg: 53m 40s | Max: 53m 40s
      🟩 Clang16            Pass: 100%/1   | Total: 54m 51s | Avg: 54m 51s | Max: 54m 51s
      🟩 Clang17            Pass: 100%/1   | Total: 53m 27s | Avg: 53m 27s | Max: 53m 27s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 21m | Avg: 45m 57s | Max:  1h 03m
      🟩 GCC6               Pass: 100%/2   | Total:  1h 40m | Avg: 50m 23s | Max: 50m 42s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 47m | Avg: 53m 52s | Max: 56m 37s
      🟩 GCC8               Pass: 100%/1   | Total: 51m 48s | Avg: 51m 48s | Max: 51m 48s
      🟩 GCC9               Pass: 100%/3   | Total:  2h 31m | Avg: 50m 27s | Max: 57m 25s
      🟩 GCC10              Pass: 100%/1   | Total: 58m 54s | Avg: 58m 54s | Max: 58m 54s
      🟩 GCC11              Pass: 100%/1   | Total: 52m 38s | Avg: 52m 38s | Max: 52m 38s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 48m | Avg: 36m 04s | Max:  1h 01m
      🟩 GCC13              Pass: 100%/8   | Total:  4h 25m | Avg: 33m 08s | Max:  1h 00m
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 58m 38s | Avg: 58m 38s | Max: 58m 38s
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 11m | Avg:  1h 11m | Max:  1h 11m | Hits: 399%/783   
      🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 06m | Avg:  1h 06m | Max:  1h 06m | Hits: 400%/783   
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 15m | Avg:  1h 07m | Max:  1h 09m | Hits: 399%/1566  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 06m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 16h 17m | Avg: 51m 27s | Max:  1h 06m
      🟩 GCC                Pass: 100%/21  | Total: 14h 56m | Avg: 42m 41s | Max:  1h 01m
      🟩 Intel              Pass: 100%/1   | Total: 58m 38s | Avg: 58m 38s | Max: 58m 38s
      🟩 MSVC               Pass: 100%/4   | Total:  4h 32m | Avg:  1h 08m | Max:  1h 11m | Hits: 399%/3132  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 06m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 47m 00s | Avg: 23m 30s | Max: 23m 51s
      🟩 rtxa6000           Pass: 100%/8   | Total:  3h 38m | Avg: 27m 20s | Max: 53m 54s
      🟩 v100               Pass: 100%/37  | Total:  1d 10h | Avg: 56m 01s | Max:  1h 11m | Hits: 399%/3132  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  1d 12h | Avg: 55m 04s | Max:  1h 11m | Hits: 399%/3132  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 15m 56s | Avg: 15m 56s | Max: 15m 56s
      🟩 GraphCapture       Pass: 100%/1   | Total: 13m 53s | Avg: 13m 53s | Max: 13m 53s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 05m | Avg: 21m 57s | Max: 23m 51s
      🟩 TestGPU            Pass: 100%/2   | Total: 40m 13s | Avg: 20m 06s | Max: 20m 35s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 47m 00s | Avg: 23m 30s | Max: 23m 51s
      🟩 90a                Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  4h 04m | Avg: 48m 53s | Max: 51m 07s
      🟩 14                 Pass: 100%/4   | Total:  4h 04m | Avg:  1h 01m | Max:  1h 11m | Hits: 399%/783   
      🟩 17                 Pass: 100%/12  | Total: 11h 26m | Avg: 57m 14s | Max:  1h 06m | Hits: 400%/1566  
      🟩 20                 Pass: 100%/26  | Total: 19h 23m | Avg: 44m 44s | Max:  1h 09m | Hits: 398%/783   
    
  • 🟩 thrust: Pass: 100%/45 | Total: 23h 54m | Avg: 31m 52s | Max: 1h 08m | Hits: 234%/7408

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 37m 41s | Avg: 18m 50s | Max: 27m 10s
    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total: 22h 56m | Avg: 32m 00s | Max:  1h 08m | Hits: 234%/7408  
      🟩 arm64              Pass: 100%/2   | Total: 58m 20s | Avg: 29m 10s | Max: 30m 58s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  3h 31m | Avg: 30m 11s | Max: 55m 35s | Hits: 236%/1852  
      🟩 12.5               Pass: 100%/2   | Total:  1h 36m | Avg: 48m 22s | Max: 48m 56s
      🟩 12.6               Pass: 100%/36  | Total: 18h 46m | Avg: 31m 17s | Max:  1h 08m | Hits: 233%/5556  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 51m 42s | Avg: 25m 51s | Max: 26m 10s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  3h 31m | Avg: 30m 11s | Max: 55m 35s | Hits: 236%/1852  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 36m | Avg: 48m 22s | Max: 48m 56s
      🟩 nvcc12.6           Pass: 100%/34  | Total: 17h 54m | Avg: 31m 36s | Max:  1h 08m | Hits: 233%/5556  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 51m 42s | Avg: 25m 51s | Max: 26m 10s
      🟩 nvcc               Pass: 100%/43  | Total: 23h 02m | Avg: 32m 09s | Max:  1h 08m | Hits: 234%/7408  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  1h 49m | Avg: 27m 20s | Max: 31m 30s
      🟩 Clang10            Pass: 100%/1   | Total: 34m 45s | Avg: 34m 45s | Max: 34m 45s
      🟩 Clang11            Pass: 100%/1   | Total: 30m 38s | Avg: 30m 38s | Max: 30m 38s
      🟩 Clang12            Pass: 100%/1   | Total: 33m 00s | Avg: 33m 00s | Max: 33m 00s
      🟩 Clang13            Pass: 100%/1   | Total: 29m 35s | Avg: 29m 35s | Max: 29m 35s
      🟩 Clang14            Pass: 100%/1   | Total: 31m 47s | Avg: 31m 47s | Max: 31m 47s
      🟩 Clang15            Pass: 100%/1   | Total: 33m 29s | Avg: 33m 29s | Max: 33m 29s
      🟩 Clang16            Pass: 100%/1   | Total: 33m 08s | Avg: 33m 08s | Max: 33m 08s
      🟩 Clang17            Pass: 100%/1   | Total: 29m 56s | Avg: 29m 56s | Max: 29m 56s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 39m | Avg: 22m 46s | Max: 32m 11s
      🟩 GCC6               Pass: 100%/2   | Total: 49m 31s | Avg: 24m 45s | Max: 27m 07s
      🟩 GCC7               Pass: 100%/2   | Total: 55m 54s | Avg: 27m 57s | Max: 31m 56s
      🟩 GCC8               Pass: 100%/1   | Total: 30m 19s | Avg: 30m 19s | Max: 30m 19s
      🟩 GCC9               Pass: 100%/3   | Total:  1h 25m | Avg: 28m 30s | Max: 30m 43s
      🟩 GCC10              Pass: 100%/1   | Total: 33m 49s | Avg: 33m 49s | Max: 33m 49s
      🟩 GCC11              Pass: 100%/1   | Total: 34m 40s | Avg: 34m 40s | Max: 34m 40s
      🟩 GCC12              Pass: 100%/1   | Total: 32m 09s | Avg: 32m 09s | Max: 32m 09s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 29m | Avg: 26m 08s | Max: 44m 55s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 39m 06s | Avg: 39m 06s | Max: 39m 06s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 55m 35s | Avg: 55m 35s | Max: 55m 35s | Hits: 236%/1852  
      🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m | Hits: 240%/1852  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 08m | Hits: 229%/3704  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 36m | Avg: 48m 22s | Max: 48m 56s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  8h 45m | Avg: 27m 38s | Max: 34m 45s
      🟩 GCC                Pass: 100%/19  | Total:  8h 51m | Avg: 27m 56s | Max: 44m 55s
      🟩 Intel              Pass: 100%/1   | Total: 39m 06s | Avg: 39m 06s | Max: 39m 06s
      🟩 MSVC               Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 08m | Hits: 234%/7408  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 36m | Avg: 48m 22s | Max: 48m 56s
    🟩 gpu
      🟩 rtx4090            Pass: 100%/8   | Total:  2h 53m | Avg: 21m 37s | Max: 44m 55s
      🟩 v100               Pass: 100%/37  | Total: 21h 01m | Avg: 34m 05s | Max:  1h 08m | Hits: 234%/7408  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total: 22h 33m | Avg: 33m 49s | Max:  1h 08m | Hits: 234%/7408  
      🟩 TestCPU            Pass: 100%/2   | Total: 16m 04s | Avg:  8m 02s | Max:  8m 22s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 05m | Avg: 21m 50s | Max: 44m 55s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 20m 16s | Avg: 20m 16s | Max: 20m 16s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  1h 59m | Avg: 23m 48s | Max: 26m 42s
      🟩 14                 Pass: 100%/4   | Total:  2h 26m | Avg: 36m 32s | Max: 55m 35s | Hits: 236%/1852  
      🟩 17                 Pass: 100%/12  | Total:  7h 34m | Avg: 37m 50s | Max:  1h 00m | Hits: 235%/3704  
      🟩 20                 Pass: 100%/22  | Total: 11h 17m | Avg: 30m 48s | Max:  1h 08m | Hits: 229%/1852  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 7m 07s | Avg: 3m 33s | Max: 4m 51s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  7m 07s | Avg:  3m 33s | Max:  4m 51s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  7m 07s | Avg:  3m 33s | Max:  4m 51s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  7m 07s | Avg:  3m 33s | Max:  4m 51s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  7m 07s | Avg:  3m 33s | Max:  4m 51s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  7m 07s | Avg:  3m 33s | Max:  4m 51s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  7m 07s | Avg:  3m 33s | Max:  4m 51s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total:  7m 07s | Avg:  3m 33s | Max:  4m 51s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 16s | Avg:  2m 16s | Max:  2m 16s
      🟩 Test               Pass: 100%/1   | Total:  4m 51s | Avg:  4m 51s | Max:  4m 51s
    
  • 🟩 python: Pass: 100%/1 | Total: 26m 26s | Avg: 26m 26s | Max: 26m 26s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 26m 26s | Avg: 26m 26s | Max: 26m 26s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 26m 26s | Avg: 26m 26s | Max: 26m 26s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 26s | Avg: 26m 26s | Max: 26m 26s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 26m 26s | Avg: 26m 26s | Max: 26m 26s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 26m 26s | Avg: 26m 26s | Max: 26m 26s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 26m 26s | Avg: 26m 26s | Max: 26m 26s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 26m 26s | Avg: 26m 26s | Max: 26m 26s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 26m 26s | Avg: 26m 26s | Max: 26m 26s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 95)

# Runner
71 linux-amd64-cpu16
8 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1
1 linux-amd64-gpu-h100-latest-1

@bernhardmgruber bernhardmgruber merged commit 8977f79 into NVIDIA:branch/2.8.x Feb 7, 2025
112 checks passed
@bernhardmgruber bernhardmgruber deleted the backport_reduce_tunings branch February 7, 2025 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants