Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuda::is_floating_point supporting half and bfloat #3379

Merged
merged 1 commit into from
Jan 21, 2025

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Jan 14, 2025

This PR adds a new header cuda/type_traits which brings in a new trait cuda::is_floating_point<T> that defaults to cuda::std::is_floating_point<T> but also provides specializations for half and bfloat.

@davebayer
Copy link
Contributor

Maybe it would be worth it go through the code and replace all uses of __is_extended_floating_point_v<T> in is_floating_point_v<T> && __is_extended_floating_point_v<T>

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that is currently nothing we intend to do yet

We are quite strict with what we allow for cuda::std and it being strictly standard compliant

Changing that type trait would definitely break standard behavior, so we can only enable this in the context of adding support for C++26 extended floating point types

@wmaxey
Copy link
Member

wmaxey commented Jan 15, 2025

Unfortunately that is currently nothing we intend to do yet

We are quite strict with what we allow for cuda::std and it being strictly standard compliant

Changing that type trait would definitely break standard behavior, so we can only enable this in the context of adding support for C++26 extended floating point types

We can expose it as cuda:: pending standardization.

@miscco
Copy link
Collaborator

miscco commented Jan 16, 2025

We can expose it as cuda:: pending standardization.

I mean they are standardized, just under a different name like float16_t

@jrhemstad
Copy link
Collaborator

I agree, I think we should add this as a cuda:: extension. The existing library types are going to exist indefinitely, so we should still provide this machinery, just not part of cuda::std::.

@bernhardmgruber bernhardmgruber changed the title Specialize is_floating_point for half and bfloat Add cuda::is_floating_point supporting half and bfloat Jan 21, 2025
@bernhardmgruber bernhardmgruber force-pushed the half_limits branch 2 times, most recently from 3f0866f to a3d5e27 Compare January 21, 2025 10:54
Copy link
Contributor

🟨 CI finished in 1h 57m: Pass: 99%/144 | Total: 1d 23h | Avg: 19m 40s | Max: 1h 04m | Hits: 509%/25812
  • 🟨 libcudacxx: Pass: 97%/46 | Total: 8h 38m | Avg: 11m 16s | Max: 35m 12s | Hits: 682%/12570

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  97%/44  | Total:  8h 28m | Avg: 11m 33s | Max: 35m 12s | Hits: 682%/12570 
      🟩 arm64              Pass: 100%/2   | Total:  9m 56s | Avg:  4m 58s | Max:  6m 40s
    🔍 ctk: 12.6 🔍
      🟩 12.0               Pass: 100%/8   | Total:  1h 13m | Avg:  9m 12s | Max: 22m 45s | Hits: 682%/4906  
      🟩 12.5               Pass: 100%/2   | Total: 47m 04s | Avg: 23m 32s | Max: 35m 12s
      🔍 12.6               Pass:  97%/36  | Total:  6h 38m | Avg: 11m 03s | Max: 28m 23s | Hits: 682%/7664  
    🔍 cudacxx: nvcc12.6 🔍
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 05m | Avg: 16m 17s | Max: 21m 16s
      🟩 nvcc12.0           Pass: 100%/8   | Total:  1h 13m | Avg:  9m 12s | Max: 22m 45s | Hits: 682%/4906  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 47m 04s | Avg: 23m 32s | Max: 35m 12s
      🔍 nvcc12.6           Pass:  96%/32  | Total:  5h 32m | Avg: 10m 24s | Max: 28m 23s | Hits: 682%/7664  
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 05m | Avg: 16m 17s | Max: 21m 16s
      🔍 nvcc               Pass:  97%/42  | Total:  7h 33m | Avg: 10m 47s | Max: 35m 12s | Hits: 682%/12570 
    🔍 cxx: GCC13 🔍
      🟩 Clang14            Pass: 100%/6   | Total: 52m 32s | Avg:  8m 45s | Max: 12m 39s
      🟩 Clang15            Pass: 100%/1   | Total:  8m 06s | Avg:  8m 06s | Max:  8m 06s
      🟩 Clang16            Pass: 100%/1   | Total:  4m 11s | Avg:  4m 11s | Max:  4m 11s
      🟩 Clang17            Pass: 100%/1   | Total:  8m 25s | Avg:  8m 25s | Max:  8m 25s
      🟩 Clang18            Pass: 100%/8   | Total:  1h 39m | Avg: 12m 28s | Max: 21m 16s
      🟩 GCC7               Pass: 100%/5   | Total: 16m 46s | Avg:  3m 21s | Max:  3m 49s
      🟩 GCC8               Pass: 100%/1   | Total:  3m 44s | Avg:  3m 44s | Max:  3m 44s
      🟩 GCC9               Pass: 100%/3   | Total:  9m 40s | Avg:  3m 13s | Max:  3m 39s
      🟩 GCC10              Pass: 100%/1   | Total:  4m 03s | Avg:  4m 03s | Max:  4m 03s
      🟩 GCC11              Pass: 100%/1   | Total:  3m 35s | Avg:  3m 35s | Max:  3m 35s
      🟩 GCC12              Pass: 100%/1   | Total:  3m 56s | Avg:  3m 56s | Max:  3m 56s
      🔍 GCC13              Pass:  90%/10  | Total:  2h 16m | Avg: 13m 37s | Max: 28m 23s
      🟩 MSVC14.29          Pass: 100%/3   | Total:  1h 08m | Avg: 22m 40s | Max: 24m 04s | Hits: 682%/7410  
      🟩 MSVC14.39          Pass: 100%/2   | Total: 52m 42s | Avg: 26m 21s | Max: 28m 22s | Hits: 682%/5160  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 47m 04s | Avg: 23m 32s | Max: 35m 12s
    🔍 cxx_family: GCC 🔍
      🟩 Clang              Pass: 100%/17  | Total:  2h 53m | Avg: 10m 10s | Max: 21m 16s
      🔍 GCC                Pass:  95%/22  | Total:  2h 57m | Avg:  8m 05s | Max: 28m 23s
      🟩 MSVC               Pass: 100%/5   | Total:  2h 00m | Avg: 24m 08s | Max: 28m 22s | Hits: 682%/12570 
      🟩 NVHPC              Pass: 100%/2   | Total: 47m 04s | Avg: 23m 32s | Max: 35m 12s
    🔍 jobs: Test 🔍
      🟩 Build              Pass: 100%/39  | Total:  6h 20m | Avg:  9m 45s | Max: 35m 12s | Hits: 682%/12570 
      🟩 NVRTC              Pass: 100%/4   | Total:  1h 43m | Avg: 25m 46s | Max: 28m 23s
      🔍 Test               Pass:  50%/2   | Total: 32m 54s | Avg: 16m 27s | Max: 16m 32s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  2m 01s | Avg:  2m 01s | Max:  2m 01s
    🔍 std: 20 🔍
      🟩 11                 Pass: 100%/6   | Total: 56m 26s | Avg:  9m 24s | Max: 23m 04s
      🟩 14                 Pass: 100%/4   | Total: 58m 39s | Avg: 14m 39s | Max: 27m 21s | Hits: 682%/2412  
      🟩 17                 Pass: 100%/14  | Total:  2h 45m | Avg: 11m 48s | Max: 28m 23s | Hits: 682%/7502  
      🔍 20                 Pass:  95%/21  | Total:  3h 56m | Avg: 11m 15s | Max: 35m 12s | Hits: 681%/2656  
    🟨 gpu
      🟨 v100               Pass:  97%/46  | Total:  8h 38m | Avg: 11m 16s | Max: 35m 12s | Hits: 682%/12570 
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total: 13m 48s | Avg: 13m 48s | Max: 13m 48s
      🟩 90a                Pass: 100%/2   | Total: 16m 15s | Avg:  8m 07s | Max: 12m 29s
    
  • 🟩 cub: Pass: 100%/38 | Total: 1d 00h | Avg: 39m 04s | Max: 1h 04m | Hits: 433%/3540

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total: 23h 01m | Avg: 38m 21s | Max:  1h 04m | Hits: 433%/3540  
      🟩 arm64              Pass: 100%/2   | Total:  1h 43m | Avg: 51m 51s | Max: 55m 51s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 27m | Avg: 41m 29s | Max: 58m 08s | Hits: 443%/885   
      🟩 12.5               Pass: 100%/2   | Total:  1h 29m | Avg: 44m 57s | Max: 45m 40s
      🟩 12.6               Pass: 100%/31  | Total: 19h 47m | Avg: 38m 18s | Max:  1h 04m | Hits: 430%/2655  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 43m | Avg: 51m 55s | Max: 53m 45s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 27m | Avg: 41m 29s | Max: 58m 08s | Hits: 443%/885   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 29m | Avg: 44m 57s | Max: 45m 40s
      🟩 nvcc12.6           Pass: 100%/29  | Total: 18h 03m | Avg: 37m 21s | Max:  1h 04m | Hits: 430%/2655  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 43m | Avg: 51m 55s | Max: 53m 45s
      🟩 nvcc               Pass: 100%/36  | Total: 23h 00m | Avg: 38m 21s | Max:  1h 04m | Hits: 433%/3540  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 36m | Avg: 39m 08s | Max: 41m 15s
      🟩 Clang15            Pass: 100%/1   | Total: 38m 22s | Avg: 38m 22s | Max: 38m 22s
      🟩 Clang16            Pass: 100%/1   | Total: 36m 47s | Avg: 36m 47s | Max: 36m 47s
      🟩 Clang17            Pass: 100%/1   | Total: 37m 18s | Avg: 37m 18s | Max: 37m 18s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 05m | Avg: 43m 42s | Max: 53m 45s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 15m | Avg: 37m 48s | Max: 38m 24s
      🟩 GCC8               Pass: 100%/1   | Total: 51m 47s | Avg: 51m 47s | Max: 51m 47s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 14m | Avg: 37m 07s | Max: 37m 28s
      🟩 GCC10              Pass: 100%/1   | Total: 39m 08s | Avg: 39m 08s | Max: 39m 08s
      🟩 GCC11              Pass: 100%/1   | Total: 37m 09s | Avg: 37m 09s | Max: 37m 09s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 02m | Avg: 20m 45s | Max: 38m 29s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 53m | Avg: 29m 14s | Max: 55m 51s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 04m | Hits: 439%/1770  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 03m | Hits: 427%/1770  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 29m | Avg: 44m 57s | Max: 45m 40s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  9h 34m | Avg: 41m 04s | Max: 53m 45s
      🟩 GCC                Pass: 100%/18  | Total:  9h 34m | Avg: 31m 53s | Max: 55m 51s
      🟩 MSVC               Pass: 100%/4   | Total:  4h 05m | Avg:  1h 01m | Max:  1h 04m | Hits: 433%/3540  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 29m | Avg: 44m 57s | Max: 45m 40s
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 23m 47s | Avg: 11m 53s | Max: 19m 23s
      🟩 v100               Pass: 100%/36  | Total:  1d 00h | Avg: 40m 35s | Max:  1h 04m | Hits: 433%/3540  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total: 21h 37m | Avg: 41m 50s | Max:  1h 04m | Hits: 433%/3540  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 24m 28s | Avg: 24m 28s | Max: 24m 28s
      🟩 GraphCapture       Pass: 100%/1   | Total: 20m 16s | Avg: 20m 16s | Max: 20m 16s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 05m | Avg: 21m 58s | Max: 25m 32s
      🟩 TestGPU            Pass: 100%/2   | Total:  1h 17m | Avg: 38m 32s | Max: 48m 35s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 23m 47s | Avg: 11m 53s | Max: 19m 23s
      🟩 90a                Pass: 100%/1   | Total:  4m 16s | Avg:  4m 16s | Max:  4m 16s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total: 10h 40m | Avg: 45m 43s | Max:  1h 04m | Hits: 435%/2655  
      🟩 20                 Pass: 100%/24  | Total: 14h 04m | Avg: 35m 11s | Max: 59m 50s | Hits: 427%/885   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 10h 45m | Avg: 17m 26s | Max: 55m 46s | Hits: 308%/9180

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 25m 27s | Avg: 12m 43s | Max: 19m 44s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total: 10h 30m | Avg: 18m 01s | Max: 55m 46s | Hits: 308%/9180  
      🟩 arm64              Pass: 100%/2   | Total: 14m 15s | Avg:  7m 07s | Max:  9m 40s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 50m | Avg: 22m 10s | Max: 44m 18s | Hits: 298%/1836  
      🟩 12.5               Pass: 100%/2   | Total:  1h 10m | Avg: 35m 27s | Max: 37m 55s
      🟩 12.6               Pass: 100%/30  | Total:  7h 43m | Avg: 15m 26s | Max: 55m 46s | Hits: 310%/7344  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 14m 46s | Avg:  7m 23s | Max:  9m 32s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 50m | Avg: 22m 10s | Max: 44m 18s | Hits: 298%/1836  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 10m | Avg: 35m 27s | Max: 37m 55s
      🟩 nvcc12.6           Pass: 100%/28  | Total:  7h 28m | Avg: 16m 01s | Max: 55m 46s | Hits: 310%/7344  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 14m 46s | Avg:  7m 23s | Max:  9m 32s
      🟩 nvcc               Pass: 100%/35  | Total: 10h 30m | Avg: 18m 00s | Max: 55m 46s | Hits: 308%/9180  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 41m 37s | Avg: 10m 24s | Max: 12m 17s
      🟩 Clang15            Pass: 100%/1   | Total: 10m 55s | Avg: 10m 55s | Max: 10m 55s
      🟩 Clang16            Pass: 100%/1   | Total:  8m 58s | Avg:  8m 58s | Max:  8m 58s
      🟩 Clang17            Pass: 100%/1   | Total: 11m 14s | Avg: 11m 14s | Max: 11m 14s
      🟩 Clang18            Pass: 100%/7   | Total:  1h 03m | Avg:  9m 00s | Max: 21m 20s
      🟩 GCC7               Pass: 100%/2   | Total: 21m 12s | Avg: 10m 36s | Max: 11m 20s
      🟩 GCC8               Pass: 100%/1   | Total: 11m 48s | Avg: 11m 48s | Max: 11m 48s
      🟩 GCC9               Pass: 100%/2   | Total: 44m 46s | Avg: 22m 23s | Max: 35m 14s
      🟩 GCC10              Pass: 100%/1   | Total: 13m 15s | Avg: 13m 15s | Max: 13m 15s
      🟩 GCC11              Pass: 100%/1   | Total:  9m 48s | Avg:  9m 48s | Max:  9m 48s
      🟩 GCC12              Pass: 100%/1   | Total: 12m 45s | Avg: 12m 45s | Max: 12m 45s
      🟩 GCC13              Pass: 100%/8   | Total:  1h 26m | Avg: 10m 52s | Max: 19m 44s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 35m | Avg: 47m 30s | Max: 50m 42s | Hits: 295%/3672  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 22m | Avg: 47m 39s | Max: 55m 46s | Hits: 316%/5508  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 10m | Avg: 35m 27s | Max: 37m 55s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  2h 15m | Avg:  9m 41s | Max: 21m 20s
      🟩 GCC                Pass: 100%/16  | Total:  3h 20m | Avg: 12m 32s | Max: 35m 14s
      🟩 MSVC               Pass: 100%/5   | Total:  3h 57m | Avg: 47m 35s | Max: 55m 46s | Hits: 308%/9180  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 10m | Avg: 35m 27s | Max: 37m 55s
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total: 10h 45m | Avg: 17m 26s | Max: 55m 46s | Hits: 308%/9180  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  8h 56m | Avg: 17m 19s | Max: 55m 46s | Hits: 294%/7344  
      🟩 TestCPU            Pass: 100%/3   | Total: 51m 53s | Avg: 17m 17s | Max: 36m 00s | Hits: 365%/1836  
      🟩 TestGPU            Pass: 100%/3   | Total: 56m 17s | Avg: 18m 45s | Max: 21m 20s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total:  4m 18s | Avg:  4m 18s | Max:  4m 18s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  4h 58m | Avg: 21m 21s | Max: 51m 11s | Hits: 294%/5508  
      🟩 20                 Pass: 100%/21  | Total:  5h 20m | Avg: 15m 16s | Max: 55m 46s | Hits: 328%/3672  
    
  • 🟩 cudax: Pass: 100%/20 | Total: 1h 53m | Avg: 5m 41s | Max: 19m 09s | Hits: 388%/522

    🟩 cpu
      🟩 amd64              Pass: 100%/16  | Total:  1h 43m | Avg:  6m 27s | Max: 19m 09s | Hits: 388%/522   
      🟩 arm64              Pass: 100%/4   | Total: 10m 35s | Avg:  2m 38s | Max:  2m 47s
    🟩 ctk
      🟩 12.0               Pass: 100%/1   | Total: 11m 08s | Avg: 11m 08s | Max: 11m 08s | Hits: 388%/261   
      🟩 12.5               Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  5m 27s
      🟩 12.6               Pass: 100%/17  | Total:  1h 32m | Avg:  5m 24s | Max: 19m 09s | Hits: 388%/261   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/1   | Total: 11m 08s | Avg: 11m 08s | Max: 11m 08s | Hits: 388%/261   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  5m 27s
      🟩 nvcc12.6           Pass: 100%/17  | Total:  1h 32m | Avg:  5m 24s | Max: 19m 09s | Hits: 388%/261   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/20  | Total:  1h 53m | Avg:  5m 41s | Max: 19m 09s | Hits: 388%/522   
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  3m 03s | Avg:  3m 03s | Max:  3m 03s
      🟩 Clang15            Pass: 100%/1   | Total:  3m 30s | Avg:  3m 30s | Max:  3m 30s
      🟩 Clang16            Pass: 100%/1   | Total:  3m 12s | Avg:  3m 12s | Max:  3m 12s
      🟩 Clang17            Pass: 100%/1   | Total:  3m 23s | Avg:  3m 23s | Max:  3m 23s
      🟩 Clang18            Pass: 100%/4   | Total: 27m 12s | Avg:  6m 48s | Max: 18m 42s
      🟩 GCC10              Pass: 100%/1   | Total:  3m 02s | Avg:  3m 02s | Max:  3m 02s
      🟩 GCC11              Pass: 100%/1   | Total:  2m 55s | Avg:  2m 55s | Max:  2m 55s
      🟩 GCC12              Pass: 100%/2   | Total: 22m 21s | Avg: 11m 10s | Max: 19m 09s
      🟩 GCC13              Pass: 100%/4   | Total: 10m 32s | Avg:  2m 38s | Max:  2m 46s
      🟩 MSVC14.36          Pass: 100%/1   | Total: 11m 08s | Avg: 11m 08s | Max: 11m 08s | Hits: 388%/261   
      🟩 MSVC14.39          Pass: 100%/1   | Total: 12m 50s | Avg: 12m 50s | Max: 12m 50s | Hits: 388%/261   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  5m 27s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/8   | Total: 40m 20s | Avg:  5m 02s | Max: 18m 42s
      🟩 GCC                Pass: 100%/8   | Total: 38m 50s | Avg:  4m 51s | Max: 19m 09s
      🟩 MSVC               Pass: 100%/2   | Total: 23m 58s | Avg: 11m 59s | Max: 12m 50s | Hits: 388%/522   
      🟩 NVHPC              Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  5m 27s
    🟩 gpu
      🟩 v100               Pass: 100%/20  | Total:  1h 53m | Avg:  5m 41s | Max: 19m 09s | Hits: 388%/522   
    🟩 jobs
      🟩 Build              Pass: 100%/18  | Total:  1h 16m | Avg:  4m 13s | Max: 12m 50s | Hits: 388%/522   
      🟩 Test               Pass: 100%/2   | Total: 37m 51s | Avg: 18m 55s | Max: 19m 09s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  2m 46s | Avg:  2m 46s | Max:  2m 46s
      🟩 90a                Pass: 100%/1   | Total:  2m 37s | Avg:  2m 37s | Max:  2m 37s
    🟩 std
      🟩 17                 Pass: 100%/4   | Total: 13m 25s | Avg:  3m 21s | Max:  5m 27s
      🟩 20                 Pass: 100%/16  | Total:  1h 40m | Avg:  6m 17s | Max: 19m 09s | Hits: 388%/522   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 39s | Avg: 4m 49s | Max: 7m 36s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 39s | Avg:  4m 49s | Max:  7m 36s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 39s | Avg:  4m 49s | Max:  7m 36s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 39s | Avg:  4m 49s | Max:  7m 36s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 39s | Avg:  4m 49s | Max:  7m 36s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 39s | Avg:  4m 49s | Max:  7m 36s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 39s | Avg:  4m 49s | Max:  7m 36s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 39s | Avg:  4m 49s | Max:  7m 36s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 03s | Avg:  2m 03s | Max:  2m 03s
      🟩 Test               Pass: 100%/1   | Total:  7m 36s | Avg:  7m 36s | Max:  7m 36s
    
  • 🟩 python: Pass: 100%/1 | Total: 1h 01m | Avg: 1h 01m | Max: 1h 01m

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 144)

# Runner
98 linux-amd64-cpu16
19 linux-amd64-gpu-v100-latest-1
16 windows-amd64-cpu16
10 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

Copy link
Contributor

🟩 CI finished in 1h 30m: Pass: 100%/144 | Total: 1d 03h | Avg: 11m 16s | Max: 1h 22m | Hits: 544%/25812
  • 🟩 libcudacxx: Pass: 100%/46 | Total: 8h 43m | Avg: 11m 22s | Max: 36m 38s | Hits: 682%/12570

    🟩 cpu
      🟩 amd64              Pass: 100%/44  | Total:  8h 36m | Avg: 11m 43s | Max: 36m 38s | Hits: 682%/12570 
      🟩 arm64              Pass: 100%/2   | Total:  7m 00s | Avg:  3m 30s | Max:  3m 37s
    🟩 ctk
      🟩 12.0               Pass: 100%/8   | Total:  1h 09m | Avg:  8m 43s | Max: 20m 57s | Hits: 682%/4906  
      🟩 12.5               Pass: 100%/2   | Total: 21m 50s | Avg: 10m 55s | Max: 13m 43s
      🟩 12.6               Pass: 100%/36  | Total:  7h 11m | Avg: 11m 59s | Max: 36m 38s | Hits: 682%/7664  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 07m | Avg: 16m 48s | Max: 21m 37s
      🟩 nvcc12.0           Pass: 100%/8   | Total:  1h 09m | Avg:  8m 43s | Max: 20m 57s | Hits: 682%/4906  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 21m 50s | Avg: 10m 55s | Max: 13m 43s
      🟩 nvcc12.6           Pass: 100%/32  | Total:  6h 04m | Avg: 11m 23s | Max: 36m 38s | Hits: 682%/7664  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 07m | Avg: 16m 48s | Max: 21m 37s
      🟩 nvcc               Pass: 100%/42  | Total:  7h 35m | Avg: 10m 51s | Max: 36m 38s | Hits: 682%/12570 
    🟩 cxx
      🟩 Clang14            Pass: 100%/6   | Total: 47m 51s | Avg:  7m 58s | Max: 17m 09s
      🟩 Clang15            Pass: 100%/1   | Total:  7m 23s | Avg:  7m 23s | Max:  7m 23s
      🟩 Clang16            Pass: 100%/1   | Total:  4m 23s | Avg:  4m 23s | Max:  4m 23s
      🟩 Clang17            Pass: 100%/1   | Total:  4m 01s | Avg:  4m 01s | Max:  4m 01s
      🟩 Clang18            Pass: 100%/8   | Total:  1h 53m | Avg: 14m 14s | Max: 30m 42s
      🟩 GCC7               Pass: 100%/5   | Total: 16m 46s | Avg:  3m 21s | Max:  3m 40s
      🟩 GCC8               Pass: 100%/1   | Total:  3m 49s | Avg:  3m 49s | Max:  3m 49s
      🟩 GCC9               Pass: 100%/3   | Total:  9m 51s | Avg:  3m 17s | Max:  3m 43s
      🟩 GCC10              Pass: 100%/1   | Total:  3m 36s | Avg:  3m 36s | Max:  3m 36s
      🟩 GCC11              Pass: 100%/1   | Total:  4m 03s | Avg:  4m 03s | Max:  4m 03s
      🟩 GCC12              Pass: 100%/1   | Total:  4m 10s | Avg:  4m 10s | Max:  4m 10s
      🟩 GCC13              Pass: 100%/10  | Total:  2h 40m | Avg: 16m 05s | Max: 36m 38s
      🟩 MSVC14.29          Pass: 100%/3   | Total:  1h 07m | Avg: 22m 21s | Max: 25m 48s | Hits: 682%/7410  
      🟩 MSVC14.39          Pass: 100%/2   | Total: 53m 39s | Avg: 26m 49s | Max: 26m 54s | Hits: 682%/5160  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 21m 50s | Avg: 10m 55s | Max: 13m 43s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/17  | Total:  2h 57m | Avg: 10m 26s | Max: 30m 42s
      🟩 GCC                Pass: 100%/22  | Total:  3h 23m | Avg:  9m 14s | Max: 36m 38s
      🟩 MSVC               Pass: 100%/5   | Total:  2h 00m | Avg: 24m 08s | Max: 26m 54s | Hits: 682%/12570 
      🟩 NVHPC              Pass: 100%/2   | Total: 21m 50s | Avg: 10m 55s | Max: 13m 43s
    🟩 gpu
      🟩 v100               Pass: 100%/46  | Total:  8h 43m | Avg: 11m 22s | Max: 36m 38s | Hits: 682%/12570 
    🟩 jobs
      🟩 Build              Pass: 100%/39  | Total:  5h 46m | Avg:  8m 53s | Max: 26m 54s | Hits: 682%/12570 
      🟩 NVRTC              Pass: 100%/4   | Total:  2h 06m | Avg: 31m 37s | Max: 36m 38s
      🟩 Test               Pass: 100%/2   | Total: 48m 04s | Avg: 24m 02s | Max: 30m 42s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  1m 54s | Avg:  1m 54s | Max:  1m 54s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total: 13m 37s | Avg: 13m 37s | Max: 13m 37s
      🟩 90a                Pass: 100%/2   | Total: 18m 08s | Avg:  9m 04s | Max: 14m 16s
    🟩 std
      🟩 11                 Pass: 100%/6   | Total: 56m 51s | Avg:  9m 28s | Max: 32m 09s
      🟩 14                 Pass: 100%/4   | Total:  1h 17m | Avg: 19m 15s | Max: 35m 54s | Hits: 682%/2412  
      🟩 17                 Pass: 100%/14  | Total:  2h 31m | Avg: 10m 47s | Max: 26m 54s | Hits: 682%/7502  
      🟩 20                 Pass: 100%/21  | Total:  3h 56m | Avg: 11m 15s | Max: 36m 38s | Hits: 681%/2656  
    
  • 🟩 cub: Pass: 100%/38 | Total: 8h 34m | Avg: 13m 33s | Max: 1h 22m | Hits: 539%/3540

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total:  8h 25m | Avg: 14m 01s | Max:  1h 22m | Hits: 539%/3540  
      🟩 arm64              Pass: 100%/2   | Total:  9m 53s | Avg:  4m 56s | Max:  5m 07s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total: 46m 01s | Avg:  9m 12s | Max: 25m 30s | Hits: 539%/885   
      🟩 12.5               Pass: 100%/2   | Total: 19m 31s | Avg:  9m 45s | Max:  9m 57s
      🟩 12.6               Pass: 100%/31  | Total:  7h 29m | Avg: 14m 29s | Max:  1h 22m | Hits: 539%/2655  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 16s | Avg:  4m 38s | Max:  4m 43s
      🟩 nvcc12.0           Pass: 100%/5   | Total: 46m 01s | Avg:  9m 12s | Max: 25m 30s | Hits: 539%/885   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 19m 31s | Avg:  9m 45s | Max:  9m 57s
      🟩 nvcc12.6           Pass: 100%/29  | Total:  7h 20m | Avg: 15m 10s | Max:  1h 22m | Hits: 539%/2655  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 16s | Avg:  4m 38s | Max:  4m 43s
      🟩 nvcc               Pass: 100%/36  | Total:  8h 25m | Avg: 14m 02s | Max:  1h 22m | Hits: 539%/3540  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 20m 51s | Avg:  5m 12s | Max:  5m 55s
      🟩 Clang15            Pass: 100%/1   | Total:  5m 33s | Avg:  5m 33s | Max:  5m 33s
      🟩 Clang16            Pass: 100%/1   | Total:  5m 53s | Avg:  5m 53s | Max:  5m 53s
      🟩 Clang17            Pass: 100%/1   | Total:  5m 29s | Avg:  5m 29s | Max:  5m 29s
      🟩 Clang18            Pass: 100%/7   | Total:  1h 38m | Avg: 14m 00s | Max: 48m 23s
      🟩 GCC7               Pass: 100%/2   | Total: 10m 36s | Avg:  5m 18s | Max:  5m 23s
      🟩 GCC8               Pass: 100%/1   | Total:  5m 20s | Avg:  5m 20s | Max:  5m 20s
      🟩 GCC9               Pass: 100%/2   | Total: 11m 17s | Avg:  5m 38s | Max:  5m 46s
      🟩 GCC10              Pass: 100%/1   | Total:  5m 59s | Avg:  5m 59s | Max:  5m 59s
      🟩 GCC11              Pass: 100%/1   | Total:  5m 51s | Avg:  5m 51s | Max:  5m 51s
      🟩 GCC12              Pass: 100%/3   | Total: 29m 26s | Avg:  9m 48s | Max: 19m 18s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 01m | Avg: 22m 44s | Max:  1h 22m
      🟩 MSVC14.29          Pass: 100%/2   | Total: 52m 15s | Avg: 26m 07s | Max: 26m 45s | Hits: 539%/1770  
      🟩 MSVC14.39          Pass: 100%/2   | Total: 56m 57s | Avg: 28m 28s | Max: 29m 40s | Hits: 539%/1770  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 19m 31s | Avg:  9m 45s | Max:  9m 57s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  2h 15m | Avg:  9m 42s | Max: 48m 23s
      🟩 GCC                Pass: 100%/18  | Total:  4h 10m | Avg: 13m 54s | Max:  1h 22m
      🟩 MSVC               Pass: 100%/4   | Total:  1h 49m | Avg: 27m 18s | Max: 29m 40s | Hits: 539%/3540  
      🟩 NVHPC              Pass: 100%/2   | Total: 19m 31s | Avg:  9m 45s | Max:  9m 57s
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 23m 42s | Avg: 11m 51s | Max: 19m 18s
      🟩 v100               Pass: 100%/36  | Total:  8h 11m | Avg: 13m 38s | Max:  1h 22m | Hits: 539%/3540  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  4h 21m | Avg:  8m 26s | Max: 29m 40s | Hits: 539%/3540  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 05s | Avg: 21m 05s | Max: 21m 05s
      🟩 GraphCapture       Pass: 100%/1   | Total:  1h 22m | Avg:  1h 22m | Max:  1h 22m
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 31m | Avg: 30m 35s | Max: 48m 23s
      🟩 TestGPU            Pass: 100%/2   | Total: 57m 50s | Avg: 28m 55s | Max: 33m 57s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 23m 42s | Avg: 11m 51s | Max: 19m 18s
      🟩 90a                Pass: 100%/1   | Total:  4m 02s | Avg:  4m 02s | Max:  4m 02s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  2h 23m | Avg: 10m 16s | Max: 27m 17s | Hits: 539%/2655  
      🟩 20                 Pass: 100%/24  | Total:  6h 11m | Avg: 15m 28s | Max:  1h 22m | Hits: 539%/885   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 7h 05m | Avg: 11m 29s | Max: 33m 39s | Hits: 365%/9180

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 36m 45s | Avg: 18m 22s | Max: 30m 19s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total:  6h 55m | Avg: 11m 52s | Max: 33m 39s | Hits: 365%/9180  
      🟩 arm64              Pass: 100%/2   | Total:  9m 39s | Avg:  4m 49s | Max:  5m 00s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total: 56m 09s | Avg: 11m 13s | Max: 28m 45s | Hits: 365%/1836  
      🟩 12.5               Pass: 100%/2   | Total: 27m 32s | Avg: 13m 46s | Max: 13m 59s
      🟩 12.6               Pass: 100%/30  | Total:  5h 41m | Avg: 11m 22s | Max: 33m 39s | Hits: 365%/7344  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  5m 12s
      🟩 nvcc12.0           Pass: 100%/5   | Total: 56m 09s | Avg: 11m 13s | Max: 28m 45s | Hits: 365%/1836  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 27m 32s | Avg: 13m 46s | Max: 13m 59s
      🟩 nvcc12.6           Pass: 100%/28  | Total:  5h 31m | Avg: 11m 49s | Max: 33m 39s | Hits: 365%/7344  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  5m 12s
      🟩 nvcc               Pass: 100%/35  | Total:  6h 54m | Avg: 11m 51s | Max: 33m 39s | Hits: 365%/9180  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 21m 48s | Avg:  5m 27s | Max:  6m 00s
      🟩 Clang15            Pass: 100%/1   | Total:  5m 28s | Avg:  5m 28s | Max:  5m 28s
      🟩 Clang16            Pass: 100%/1   | Total:  5m 13s | Avg:  5m 13s | Max:  5m 13s
      🟩 Clang17            Pass: 100%/1   | Total:  5m 41s | Avg:  5m 41s | Max:  5m 41s
      🟩 Clang18            Pass: 100%/7   | Total:  1h 07m | Avg:  9m 36s | Max: 33m 03s
      🟩 GCC7               Pass: 100%/2   | Total: 10m 20s | Avg:  5m 10s | Max:  5m 12s
      🟩 GCC8               Pass: 100%/1   | Total:  5m 29s | Avg:  5m 29s | Max:  5m 29s
      🟩 GCC9               Pass: 100%/2   | Total: 17m 36s | Avg:  8m 48s | Max: 11m 56s
      🟩 GCC10              Pass: 100%/1   | Total:  5m 16s | Avg:  5m 16s | Max:  5m 16s
      🟩 GCC11              Pass: 100%/1   | Total:  5m 50s | Avg:  5m 50s | Max:  5m 50s
      🟩 GCC12              Pass: 100%/1   | Total:  6m 18s | Avg:  6m 18s | Max:  6m 18s
      🟩 GCC13              Pass: 100%/8   | Total:  1h 24m | Avg: 10m 36s | Max: 30m 19s
      🟩 MSVC14.29          Pass: 100%/2   | Total: 58m 00s | Avg: 29m 00s | Max: 29m 15s | Hits: 365%/3672  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 38m | Avg: 32m 47s | Max: 33m 39s | Hits: 365%/5508  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 27m 32s | Avg: 13m 46s | Max: 13m 59s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  1h 45m | Avg:  7m 31s | Max: 33m 03s
      🟩 GCC                Pass: 100%/16  | Total:  2h 15m | Avg:  8m 28s | Max: 30m 19s
      🟩 MSVC               Pass: 100%/5   | Total:  2h 36m | Avg: 31m 16s | Max: 33m 39s | Hits: 365%/9180  
      🟩 NVHPC              Pass: 100%/2   | Total: 27m 32s | Avg: 13m 46s | Max: 13m 59s
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total:  7h 05m | Avg: 11m 29s | Max: 33m 39s | Hits: 365%/9180  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  4h 52m | Avg:  9m 26s | Max: 32m 51s | Hits: 365%/7344  
      🟩 TestCPU            Pass: 100%/3   | Total: 49m 17s | Avg: 16m 25s | Max: 33m 39s | Hits: 365%/1836  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 23m | Avg: 27m 41s | Max: 33m 03s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total:  4m 16s | Avg:  4m 16s | Max:  4m 16s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  2h 44m | Avg: 11m 45s | Max: 31m 52s | Hits: 365%/5508  
      🟩 20                 Pass: 100%/21  | Total:  3h 43m | Avg: 10m 39s | Max: 33m 39s | Hits: 365%/3672  
    
  • 🟩 cudax: Pass: 100%/20 | Total: 1h 48m | Avg: 5m 25s | Max: 20m 18s | Hits: 388%/522

    🟩 cpu
      🟩 amd64              Pass: 100%/16  | Total:  1h 38m | Avg:  6m 08s | Max: 20m 18s | Hits: 388%/522   
      🟩 arm64              Pass: 100%/4   | Total: 10m 14s | Avg:  2m 33s | Max:  2m 34s
    🟩 ctk
      🟩 12.0               Pass: 100%/1   | Total: 10m 53s | Avg: 10m 53s | Max: 10m 53s | Hits: 388%/261   
      🟩 12.5               Pass: 100%/2   | Total: 10m 48s | Avg:  5m 24s | Max:  5m 26s
      🟩 12.6               Pass: 100%/17  | Total:  1h 26m | Avg:  5m 06s | Max: 20m 18s | Hits: 388%/261   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/1   | Total: 10m 53s | Avg: 10m 53s | Max: 10m 53s | Hits: 388%/261   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 10m 48s | Avg:  5m 24s | Max:  5m 26s
      🟩 nvcc12.6           Pass: 100%/17  | Total:  1h 26m | Avg:  5m 06s | Max: 20m 18s | Hits: 388%/261   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/20  | Total:  1h 48m | Avg:  5m 25s | Max: 20m 18s | Hits: 388%/522   
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  3m 16s | Avg:  3m 16s | Max:  3m 16s
      🟩 Clang15            Pass: 100%/1   | Total:  3m 19s | Avg:  3m 19s | Max:  3m 19s
      🟩 Clang16            Pass: 100%/1   | Total:  3m 09s | Avg:  3m 09s | Max:  3m 09s
      🟩 Clang17            Pass: 100%/1   | Total:  3m 06s | Avg:  3m 06s | Max:  3m 06s
      🟩 Clang18            Pass: 100%/4   | Total: 28m 30s | Avg:  7m 07s | Max: 20m 18s
      🟩 GCC10              Pass: 100%/1   | Total:  3m 01s | Avg:  3m 01s | Max:  3m 01s
      🟩 GCC11              Pass: 100%/1   | Total:  3m 13s | Avg:  3m 13s | Max:  3m 13s
      🟩 GCC12              Pass: 100%/2   | Total: 17m 47s | Avg:  8m 53s | Max: 14m 38s
      🟩 GCC13              Pass: 100%/4   | Total: 10m 18s | Avg:  2m 34s | Max:  2m 36s
      🟩 MSVC14.36          Pass: 100%/1   | Total: 10m 53s | Avg: 10m 53s | Max: 10m 53s | Hits: 388%/261   
      🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 03s | Avg: 11m 03s | Max: 11m 03s | Hits: 388%/261   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 10m 48s | Avg:  5m 24s | Max:  5m 26s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/8   | Total: 41m 20s | Avg:  5m 10s | Max: 20m 18s
      🟩 GCC                Pass: 100%/8   | Total: 34m 19s | Avg:  4m 17s | Max: 14m 38s
      🟩 MSVC               Pass: 100%/2   | Total: 21m 56s | Avg: 10m 58s | Max: 11m 03s | Hits: 388%/522   
      🟩 NVHPC              Pass: 100%/2   | Total: 10m 48s | Avg:  5m 24s | Max:  5m 26s
    🟩 gpu
      🟩 v100               Pass: 100%/20  | Total:  1h 48m | Avg:  5m 25s | Max: 20m 18s | Hits: 388%/522   
    🟩 jobs
      🟩 Build              Pass: 100%/18  | Total:  1h 13m | Avg:  4m 04s | Max: 11m 03s | Hits: 388%/522   
      🟩 Test               Pass: 100%/2   | Total: 34m 56s | Avg: 17m 28s | Max: 20m 18s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  2m 35s | Avg:  2m 35s | Max:  2m 35s
      🟩 90a                Pass: 100%/1   | Total:  2m 36s | Avg:  2m 36s | Max:  2m 36s
    🟩 std
      🟩 17                 Pass: 100%/4   | Total: 13m 09s | Avg:  3m 17s | Max:  5m 26s
      🟩 20                 Pass: 100%/16  | Total:  1h 35m | Avg:  5m 57s | Max: 20m 18s | Hits: 388%/522   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 38s | Avg: 4m 49s | Max: 7m 30s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 38s | Avg:  4m 49s | Max:  7m 30s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 38s | Avg:  4m 49s | Max:  7m 30s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 38s | Avg:  4m 49s | Max:  7m 30s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 38s | Avg:  4m 49s | Max:  7m 30s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 38s | Avg:  4m 49s | Max:  7m 30s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 38s | Avg:  4m 49s | Max:  7m 30s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 38s | Avg:  4m 49s | Max:  7m 30s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 08s | Avg:  2m 08s | Max:  2m 08s
      🟩 Test               Pass: 100%/1   | Total:  7m 30s | Avg:  7m 30s | Max:  7m 30s
    
  • 🟩 python: Pass: 100%/1 | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 42m 46s | Avg: 42m 46s | Max: 42m 46s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 144)

# Runner
98 linux-amd64-cpu16
19 linux-amd64-gpu-v100-latest-1
16 windows-amd64-cpu16
10 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@bernhardmgruber bernhardmgruber merged commit 4812f28 into NVIDIA:main Jan 21, 2025
156 of 159 checks passed
@bernhardmgruber bernhardmgruber deleted the half_limits branch January 21, 2025 15:39
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this pull request Jan 22, 2025
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
davebayer added a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
update docs

update docs

add `memcmp`, `memmove` and `memchr` implementations

implement tests

Use cuda::std::min/max in Thrust (NVIDIA#3364)

Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361)

* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`

Cleanup util_arch (NVIDIA#2773)

Deprecate thrust::null_type (NVIDIA#3367)

Deprecate cub::DeviceSpmv (NVIDIA#3320)

Fixes: NVIDIA#896

Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Compile basic infra test with C++17 (NVIDIA#3377)

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

Exit with error when RAPIDS CI fails. (NVIDIA#3385)

cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Deprecate thrust::async (NVIDIA#3324)

Fixes: NVIDIA#100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342)

Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314)

* add compiler-specific path
* fix device code path
* add _CCC_ASSUME

Deprecate thrust::numeric_limits (NVIDIA#3366)

Replace `typedef` with `using` in libcu++ (NVIDIA#3368)

Deprecate thrust::optional (NVIDIA#3307)

Fixes: NVIDIA#3306

Upgrade to Catch2 3.8  (NVIDIA#3310)

Fixes: NVIDIA#1724

refactor `<cuda/std/cstdint>` (NVIDIA#3325)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Update CODEOWNERS (NVIDIA#3331)

* Update CODEOWNERS

* Update CODEOWNERS

* Update CODEOWNERS

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix sign-compare warning (NVIDIA#3408)

Implement more cmath functions to be usable on host and device (NVIDIA#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (NVIDIA#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418)

NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.

Fix this by always using the host definition which should also work on device.

Fixes NVIDIA#3411

Extend CUB reduce benchmarks (NVIDIA#3401)

* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters

Fixes: NVIDIA#3283

Update upload-pages-artifact to v3 (NVIDIA#3423)

* Update upload-pages-artifact to v3

* Empty commit

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421)

`std::linalg` accessors and `transposed_layout` (NVIDIA#2962)

Add round up/down to multiple (NVIDIA#3234)

[FEA]: Introduce Python module with CCCL headers (NVIDIA#3201)

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative

* Run `copy_cccl_headers_to_aude_include()` before `setup()`

* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.

* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

* Bug fix: cuda/_include only exists after shutil.copytree() ran.

* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py

* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)

* Replace := operator (needs Python 3.8+)

* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md

* Restore original README.md: `pip3 install -e` now works on first pass.

* cuda_cccl/README.md: FOR INTERNAL USE ONLY

* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment))

Command used: ci/update_version.sh 2 8 0

* Modernize pyproject.toml, setup.py

Trigger for this change:

* NVIDIA#3201 (comment)

* NVIDIA#3201 (comment)

* Install CCCL headers under cuda.cccl.include

Trigger for this change:

* NVIDIA#3201 (comment)

Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.

* Factor out cuda_cccl/cuda/cccl/include_paths.py

* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative

* Add missing Copyright notice.

* Add missing __init__.py (cuda.cccl)

* Add `"cuda.cccl"` to `autodoc.mock_imports`

* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)

* Add # TODO: move this to a module-level import

* Modernize cuda_cooperative/pyproject.toml, setup.py

* Convert cuda_cooperative to use hatchling as build backend.

* Revert "Convert cuda_cooperative to use hatchling as build backend."

This reverts commit 61637d6.

* Move numpy from [build-system] requires -> [project] dependencies

* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH

* Remove copy_license() and use license_files=["../../LICENSE"] instead.

* Further modernize cuda_cccl/setup.py to use pathlib

* Trivial simplifications in cuda_cccl/pyproject.toml

* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code

* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml

* Add taplo-pre-commit to .pre-commit-config.yaml

* taplo-pre-commit auto-fixes

* Use pathlib in cuda_cooperative/setup.py

* CCCL_PYTHON_PATH in cuda_cooperative/setup.py

* Modernize cuda_parallel/pyproject.toml, setup.py

* Use pathlib in cuda_parallel/setup.py

* Add `# TOML lint & format` comment.

* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml

* Use pathlib in cuda/cccl/include_paths.py

* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)

* Fixes after git merge main

* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'

```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
  /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>

  Traceback (most recent call last):
    File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
      bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
                                                       ^^^^^^^^^^^^^^^^^
  AttributeError: '_Reduce' object has no attribute 'build_result'

    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```

* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`

* Introduce cuda_cooperative/constraints.txt

* Also add cuda_parallel/constraints.txt

* Add `--constraint constraints.txt` in ci/test_python.sh

* Update Copyright dates

* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)

For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.

* Remove unused cuda_parallel jinja2 dependency (noticed by chance).

* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.

* Make cuda_cooperative, cuda_parallel testing completely independent.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"

This reverts commit ea33a21.

Error message: NVIDIA#3201 (comment)

* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Restore original ci/matrix.yaml [skip-rapids]

* Use for loop in test_python.sh to avoid code duplication.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]

* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"

This reverts commit ec206fd.

* Implement suggestion by @shwina (NVIDIA#3201 (review))

* Address feedback by @leofang

---------

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434)

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes NVIDIA#3404

Fix CI issues (NVIDIA#3443)

Remove deprecated `cub::min` (NVIDIA#3450)

* Remove deprecated `cuda::{min,max}`

* Drop unused `thrust::remove_cvref` file

Fix typo in builtin (NVIDIA#3451)

Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435)

uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436)

Default transform_iterator's copy ctor (NVIDIA#3395)

Fixes: NVIDIA#2393

Turn C++ dialect warning into error (NVIDIA#3453)

Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437)

* uses thrust's dynamic dispatch for merge_sort

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Refactor allocator handling of contiguous_storage (NVIDIA#3050)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Drop thrust::detail::integer_traits (NVIDIA#3391)

Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Improve docs of std headers (NVIDIA#3416)

Drop C++11 and C++14 support for all of cccl (NVIDIA#3417)

* Drop C++11 and C++14 support for all of cccl

---------

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Deprecate a few CUB macros (NVIDIA#3456)

Deprecate thrust universal iterator categories (NVIDIA#3461)

Fix launch args order (NVIDIA#3465)

Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432)

add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429)

Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433)

Drop universal iterator categories (NVIDIA#3474)

Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472)

Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470)

Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Moves CUB kernel entry points to a detail namespace (NVIDIA#3468)

* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces

Deprecate block/warp algo specializations (NVIDIA#3455)

Fixes: NVIDIA#3409

Refactor CUB's util_debug (NVIDIA#3345)
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
miscco added a commit that referenced this pull request Jan 22, 2025
* add `_CCCL_HAS_NVFP8` macro (#3429)

* Add cuda::is_floating_point supporting half and bfloat (#3379)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

* Specialize __is_extended_floating_point for FP8 types (#3470)

Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements

Co-authored-by: Michael Schellenberger Costa <[email protected]>

---------

Co-authored-by: Federico Busato <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 23, 2025
davebayer added a commit to davebayer/cccl that referenced this pull request Jan 23, 2025
Cleanup util_arch (NVIDIA#2773)

Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Deprecate thrust::async (NVIDIA#3324)

Fixes: NVIDIA#100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342)

Deprecate thrust::numeric_limits (NVIDIA#3366)

Upgrade to Catch2 3.8  (NVIDIA#3310)

Fixes: NVIDIA#1724

Fix sign-compare warning (NVIDIA#3408)

Implement more cmath functions to be usable on host and device (NVIDIA#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (NVIDIA#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <[email protected]>

cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes NVIDIA#3404

Remove deprecated `cub::min` (NVIDIA#3450)

* Remove deprecated `cuda::{min,max}`

* Drop unused `thrust::remove_cvref` file

Fix typo in builtin (NVIDIA#3451)

Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435)

Drop thrust::detail::integer_traits (NVIDIA#3391)

Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429)

Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470)

Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Moves CUB kernel entry points to a detail namespace (NVIDIA#3468)

* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces

Deprecate block/warp algo specializations (NVIDIA#3455)

Fixes: NVIDIA#3409

fix documentation
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants