Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport to 2.8: B200 tunings for histogram #3728

Merged
merged 2 commits into from
Feb 7, 2025

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber requested review from a team as code owners February 6, 2025 22:54
bernhardmgruber and others added 2 commits February 6, 2025 20:40
The tuning data member names did not match the one used when selecting
tunings, so all SM100 tunings were SFINAE-ed out.

Also drop tunings with no benefit.
@elstehle elstehle force-pushed the backport_tune_histo branch from 3a12ff7 to 698746b Compare February 7, 2025 04:41
@elstehle elstehle enabled auto-merge (squash) February 7, 2025 04:41
Copy link
Contributor

github-actions bot commented Feb 7, 2025

🟩 CI finished in 2h 40m: Pass: 100%/95 | Total: 16h 17m | Avg: 10m 17s | Max: 44m 13s | Hits: 434%/10540
  • 🟩 cub: Pass: 100%/47 | Total: 8h 36m | Avg: 10m 59s | Max: 44m 13s | Hits: 596%/3132

    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total:  8h 24m | Avg: 11m 12s | Max: 44m 13s | Hits: 596%/3132  
      🟩 arm64              Pass: 100%/2   | Total: 12m 32s | Avg:  6m 16s | Max:  6m 34s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  1h 21m | Avg: 11m 36s | Max: 44m 13s | Hits: 595%/783   
      🟩 12.5               Pass: 100%/2   | Total: 22m 26s | Avg: 11m 13s | Max: 11m 20s
      🟩 12.6               Pass: 100%/38  | Total:  6h 52m | Avg: 10m 51s | Max: 32m 00s | Hits: 596%/2349  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 51s | Avg:  5m 25s | Max:  5m 33s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  1h 21m | Avg: 11m 36s | Max: 44m 13s | Hits: 595%/783   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 22m 26s | Avg: 11m 13s | Max: 11m 20s
      🟩 nvcc12.6           Pass: 100%/36  | Total:  6h 42m | Avg: 11m 10s | Max: 32m 00s | Hits: 596%/2349  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 51s | Avg:  5m 25s | Max:  5m 33s
      🟩 nvcc               Pass: 100%/45  | Total:  8h 25m | Avg: 11m 14s | Max: 44m 13s | Hits: 596%/3132  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total: 27m 46s | Avg:  6m 56s | Max:  7m 54s
      🟩 Clang10            Pass: 100%/1   | Total:  7m 55s | Avg:  7m 55s | Max:  7m 55s
      🟩 Clang11            Pass: 100%/1   | Total:  6m 02s | Avg:  6m 02s | Max:  6m 02s
      🟩 Clang12            Pass: 100%/1   | Total:  6m 46s | Avg:  6m 46s | Max:  6m 46s
      🟩 Clang13            Pass: 100%/1   | Total:  6m 28s | Avg:  6m 28s | Max:  6m 28s
      🟩 Clang14            Pass: 100%/1   | Total:  6m 25s | Avg:  6m 25s | Max:  6m 25s
      🟩 Clang15            Pass: 100%/1   | Total:  6m 20s | Avg:  6m 20s | Max:  6m 20s
      🟩 Clang16            Pass: 100%/1   | Total:  6m 32s | Avg:  6m 32s | Max:  6m 32s
      🟩 Clang17            Pass: 100%/1   | Total:  6m 59s | Avg:  6m 59s | Max:  6m 59s
      🟩 Clang18            Pass: 100%/7   | Total:  1h 10m | Avg: 10m 03s | Max: 22m 41s
      🟩 GCC6               Pass: 100%/2   | Total: 13m 13s | Avg:  6m 36s | Max:  6m 59s
      🟩 GCC7               Pass: 100%/2   | Total: 14m 29s | Avg:  7m 14s | Max:  7m 39s
      🟩 GCC8               Pass: 100%/1   | Total:  6m 57s | Avg:  6m 57s | Max:  6m 57s
      🟩 GCC9               Pass: 100%/3   | Total: 18m 55s | Avg:  6m 18s | Max:  7m 02s
      🟩 GCC10              Pass: 100%/1   | Total:  7m 16s | Avg:  7m 16s | Max:  7m 16s
      🟩 GCC11              Pass: 100%/1   | Total:  7m 07s | Avg:  7m 07s | Max:  7m 07s
      🟩 GCC12              Pass: 100%/3   | Total: 34m 35s | Avg: 11m 31s | Max: 22m 26s
      🟩 GCC13              Pass: 100%/8   | Total:  1h 37m | Avg: 12m 12s | Max: 20m 53s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total:  8m 28s | Avg:  8m 28s | Max:  8m 28s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 44m 13s | Avg: 44m 13s | Max: 44m 13s | Hits: 595%/783   
      🟩 MSVC14.29          Pass: 100%/1   | Total: 28m 36s | Avg: 28m 36s | Max: 28m 36s | Hits: 596%/783   
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 01m | Avg: 30m 30s | Max: 32m 00s | Hits: 596%/1566  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 22m 26s | Avg: 11m 13s | Max: 11m 20s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  2h 31m | Avg:  7m 58s | Max: 22m 41s
      🟩 GCC                Pass: 100%/21  | Total:  3h 20m | Avg:  9m 32s | Max: 22m 26s
      🟩 Intel              Pass: 100%/1   | Total:  8m 28s | Avg:  8m 28s | Max:  8m 28s
      🟩 MSVC               Pass: 100%/4   | Total:  2h 13m | Avg: 33m 27s | Max: 44m 13s | Hits: 596%/3132  
      🟩 NVHPC              Pass: 100%/2   | Total: 22m 26s | Avg: 11m 13s | Max: 11m 20s
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 27m 42s | Avg: 13m 51s | Max: 22m 26s
      🟩 rtxa6000           Pass: 100%/8   | Total:  2h 06m | Avg: 15m 48s | Max: 22m 41s
      🟩 v100               Pass: 100%/37  | Total:  6h 02m | Avg:  9m 47s | Max: 44m 13s | Hits: 596%/3132  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  6h 21m | Avg:  9m 32s | Max: 44m 13s | Hits: 596%/3132  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 15m 48s | Avg: 15m 48s | Max: 15m 48s
      🟩 GraphCapture       Pass: 100%/1   | Total: 14m 24s | Avg: 14m 24s | Max: 14m 24s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 05m | Avg: 21m 59s | Max: 22m 41s
      🟩 TestGPU            Pass: 100%/2   | Total: 38m 57s | Avg: 19m 28s | Max: 20m 53s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 27m 42s | Avg: 13m 51s | Max: 22m 26s
      🟩 90a                Pass: 100%/1   | Total:  5m 05s | Avg:  5m 05s | Max:  5m 05s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total: 33m 30s | Avg:  6m 42s | Max:  7m 54s
      🟩 14                 Pass: 100%/4   | Total:  1h 06m | Avg: 16m 41s | Max: 44m 13s | Hits: 595%/783   
      🟩 17                 Pass: 100%/12  | Total:  2h 09m | Avg: 10m 47s | Max: 29m 01s | Hits: 596%/1566  
      🟩 20                 Pass: 100%/26  | Total:  4h 46m | Avg: 11m 02s | Max: 32m 00s | Hits: 596%/783   
    
  • 🟩 thrust: Pass: 100%/45 | Total: 7h 07m | Avg: 9m 29s | Max: 31m 16s | Hits: 366%/7408

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 16m 11s | Avg:  8m 05s | Max: 10m 34s
    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total:  6h 31m | Avg:  9m 06s | Max: 31m 00s | Hits: 366%/7408  
      🟩 arm64              Pass: 100%/2   | Total: 35m 48s | Avg: 17m 54s | Max: 31m 16s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  1h 17m | Avg: 11m 04s | Max: 30m 52s | Hits: 368%/1852  
      🟩 12.5               Pass: 100%/2   | Total: 27m 40s | Avg: 13m 50s | Max: 14m 11s
      🟩 12.6               Pass: 100%/36  | Total:  5h 22m | Avg:  8m 56s | Max: 31m 16s | Hits: 365%/5556  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 32s | Avg:  5m 16s | Max:  5m 16s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  1h 17m | Avg: 11m 04s | Max: 30m 52s | Hits: 368%/1852  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 27m 40s | Avg: 13m 50s | Max: 14m 11s
      🟩 nvcc12.6           Pass: 100%/34  | Total:  5h 11m | Avg:  9m 09s | Max: 31m 16s | Hits: 365%/5556  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 32s | Avg:  5m 16s | Max:  5m 16s
      🟩 nvcc               Pass: 100%/43  | Total:  6h 56m | Avg:  9m 41s | Max: 31m 16s | Hits: 366%/7408  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total: 20m 57s | Avg:  5m 14s | Max:  6m 24s
      🟩 Clang10            Pass: 100%/1   | Total:  6m 50s | Avg:  6m 50s | Max:  6m 50s
      🟩 Clang11            Pass: 100%/1   | Total:  5m 13s | Avg:  5m 13s | Max:  5m 13s
      🟩 Clang12            Pass: 100%/1   | Total:  5m 14s | Avg:  5m 14s | Max:  5m 14s
      🟩 Clang13            Pass: 100%/1   | Total:  5m 09s | Avg:  5m 09s | Max:  5m 09s
      🟩 Clang14            Pass: 100%/1   | Total:  5m 15s | Avg:  5m 15s | Max:  5m 15s
      🟩 Clang15            Pass: 100%/1   | Total:  5m 43s | Avg:  5m 43s | Max:  5m 43s
      🟩 Clang16            Pass: 100%/1   | Total:  5m 37s | Avg:  5m 37s | Max:  5m 37s
      🟩 Clang17            Pass: 100%/1   | Total:  5m 23s | Avg:  5m 23s | Max:  5m 23s
      🟩 Clang18            Pass: 100%/7   | Total: 43m 56s | Avg:  6m 16s | Max: 10m 08s
      🟩 GCC6               Pass: 100%/2   | Total: 34m 40s | Avg: 17m 20s | Max: 30m 52s
      🟩 GCC7               Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  5m 27s
      🟩 GCC8               Pass: 100%/1   | Total:  9m 44s | Avg:  9m 44s | Max:  9m 44s
      🟩 GCC9               Pass: 100%/3   | Total: 14m 43s | Avg:  4m 54s | Max:  5m 40s
      🟩 GCC10              Pass: 100%/1   | Total:  5m 16s | Avg:  5m 16s | Max:  5m 16s
      🟩 GCC11              Pass: 100%/1   | Total:  5m 53s | Avg:  5m 53s | Max:  5m 53s
      🟩 GCC12              Pass: 100%/1   | Total:  5m 59s | Avg:  5m 59s | Max:  5m 59s
      🟩 GCC13              Pass: 100%/8   | Total:  1h 23m | Avg: 10m 28s | Max: 31m 16s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total:  6m 59s | Avg:  6m 59s | Max:  6m 59s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 24m 44s | Avg: 24m 44s | Max: 24m 44s | Hits: 368%/1852  
      🟩 MSVC14.29          Pass: 100%/1   | Total: 26m 40s | Avg: 26m 40s | Max: 26m 40s | Hits: 365%/1852  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 01m | Avg: 30m 55s | Max: 31m 00s | Hits: 365%/3704  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 27m 40s | Avg: 13m 50s | Max: 14m 11s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  1h 49m | Avg:  5m 45s | Max: 10m 08s
      🟩 GCC                Pass: 100%/19  | Total:  2h 50m | Avg:  8m 57s | Max: 31m 16s
      🟩 Intel              Pass: 100%/1   | Total:  6m 59s | Avg:  6m 59s | Max:  6m 59s
      🟩 MSVC               Pass: 100%/4   | Total:  1h 53m | Avg: 28m 18s | Max: 31m 00s | Hits: 366%/7408  
      🟩 NVHPC              Pass: 100%/2   | Total: 27m 40s | Avg: 13m 50s | Max: 14m 11s
    🟩 gpu
      🟩 rtx4090            Pass: 100%/8   | Total:  1h 03m | Avg:  7m 52s | Max: 10m 43s
      🟩 v100               Pass: 100%/37  | Total:  6h 04m | Avg:  9m 50s | Max: 31m 16s | Hits: 366%/7408  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  6h 20m | Avg:  9m 31s | Max: 31m 16s | Hits: 366%/7408  
      🟩 TestCPU            Pass: 100%/2   | Total: 15m 10s | Avg:  7m 35s | Max:  7m 50s
      🟩 TestGPU            Pass: 100%/3   | Total: 31m 25s | Avg: 10m 28s | Max: 10m 43s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total:  4m 42s | Avg:  4m 42s | Max:  4m 42s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total: 22m 10s | Avg:  4m 26s | Max:  5m 31s
      🟩 14                 Pass: 100%/4   | Total:  1h 07m | Avg: 16m 51s | Max: 30m 52s | Hits: 368%/1852  
      🟩 17                 Pass: 100%/12  | Total:  2h 09m | Avg: 10m 45s | Max: 30m 50s | Hits: 365%/3704  
      🟩 20                 Pass: 100%/22  | Total:  3h 12m | Avg:  8m 45s | Max: 31m 16s | Hits: 365%/1852  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 7m 11s | Avg: 3m 35s | Max: 5m 06s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  5m 06s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  5m 06s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  5m 06s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  5m 06s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  5m 06s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  5m 06s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total:  7m 11s | Avg:  3m 35s | Max:  5m 06s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 05s | Avg:  2m 05s | Max:  2m 05s
      🟩 Test               Pass: 100%/1   | Total:  5m 06s | Avg:  5m 06s | Max:  5m 06s
    
  • 🟩 python: Pass: 100%/1 | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 26m 09s | Avg: 26m 09s | Max: 26m 09s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 95)

# Runner
71 linux-amd64-cpu16
8 windows-amd64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-arm64-cpu16
3 linux-amd64-gpu-rtx4090-latest-1
2 linux-amd64-gpu-rtx2080-latest-1
1 linux-amd64-gpu-h100-latest-1

@elstehle elstehle merged commit 5571cd8 into NVIDIA:branch/2.8.x Feb 7, 2025
111 checks passed
@bernhardmgruber bernhardmgruber deleted the backport_tune_histo branch February 7, 2025 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants