Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` #2647

elstehle · 2024-10-29T13:09:06Z

This PR depends on #3148, which needs to be merged first.

Description

This PR adds support for large num_items to DeviceReduce::{ArgMin,ArgMax}.

For inputs that exceed INT_MAX num_items, we split the input into partitions of up to INT_MAX items each. For each partition, we compute the ArgExtremum and then compare that partition's result to the ArgExtremum result we got thus far.

The fundamental idea here is that we do not need to carry the wide OffsetT holding the extremum's index everywhere, but only once we cross a partition's boundary.

Closes #2515

With the streaming approach, we managed to mitigate the performance downside to 3.5% over all benchmarked workloads and a worst-case slow down of 0.5% for 2^28 number of items.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

cub/cub/device/device_reduce.cuh

bernhardmgruber

I just quickly looked over the C++, not the algorithm. Here is some feedback:

Generally, you could make a lot of your local variables const, since their value won't change. This usually helps understanding code faster, at least for me.

cub/benchmarks/bench/reduce/arg_max.cu

cub/test/catch2_test_device_reduce_large_offsets.cu

cub/cub/device/dispatch/dispatch_streaming_reduce.cuh

elstehle · 2024-10-30T07:59:36Z

Temporarily switching back to draft PR, to identify the changes that degraded performance in the meanwhile 🙈

github-actions · 2024-10-30T20:22:51Z

🟨 CI finished in 2h 02m: Pass: 98%/222 | Total: 6d 02h | Avg: 39m 37s | Max: 1h 13m | Hits: 79%/16113

🟨 cub: Pass: 96%/110 | Total: 3d 20h | Avg: 50m 16s | Max: 1h 13m | Hits: 64%/2948

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  96%/102 | Total:  3d 12h | Avg: 49m 57s | Max:  1h 13m | Hits:  64%/2948  
  🟩 arm64              Pass: 100%/8   | Total:  7h 14m | Avg: 54m 17s | Max: 55m 29s
🔍 ctk: 12.6 🔍
  🟩 11.1               Pass: 100%/15  | Total: 11h 27m | Avg: 45m 51s | Max: 57m 30s | Hits:  64%/737   
  🟩 11.8               Pass: 100%/3   | Total:  3h 29m | Avg:  1h 09m | Max:  1h 13m
  🟩 12.5               Pass: 100%/4   | Total:  3h 59m | Avg: 59m 57s | Max:  1h 05m
  🔍 12.6               Pass:  95%/88  | Total:  3d 01h | Avg: 49m 55s | Max:  1h 01m | Hits:  64%/2211  
🚨 cudacxx: ClangCUDA18 🚨
  🔥 ClangCUDA18        Pass:   0%/4   | Total:  3h 41m | Avg: 55m 17s | Max: 57m 22s
  🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 27m | Avg: 45m 51s | Max: 57m 30s | Hits:  64%/737   
  🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 29m | Avg:  1h 09m | Max:  1h 13m
  🟩 nvcc12.5           Pass: 100%/4   | Total:  3h 59m | Avg: 59m 57s | Max:  1h 05m
  🟩 nvcc12.6           Pass: 100%/84  | Total:  2d 21h | Avg: 49m 40s | Max:  1h 01m | Hits:  64%/2211  
🚨 cudacxx_family: ClangCUDA 🚨
  🔥 ClangCUDA          Pass:   0%/4   | Total:  3h 41m | Avg: 55m 17s | Max: 57m 22s
  🟩 nvcc               Pass: 100%/106 | Total:  3d 16h | Avg: 50m 05s | Max:  1h 13m | Hits:  64%/2948  
🔍 cxx: Clang18 🔍
  🟩 Clang9             Pass: 100%/6   | Total:  5h 02m | Avg: 50m 28s | Max: 58m 12s
  🟩 Clang10            Pass: 100%/3   | Total:  2h 48m | Avg: 56m 06s | Max: 58m 52s
  🟩 Clang11            Pass: 100%/4   | Total:  3h 35m | Avg: 53m 54s | Max: 56m 59s
  🟩 Clang12            Pass: 100%/4   | Total:  3h 39m | Avg: 54m 50s | Max: 57m 15s
  🟩 Clang13            Pass: 100%/4   | Total:  3h 22m | Avg: 50m 32s | Max: 52m 29s
  🟩 Clang14            Pass: 100%/4   | Total:  3h 35m | Avg: 53m 49s | Max: 56m 26s
  🟩 Clang15            Pass: 100%/4   | Total:  3h 33m | Avg: 53m 26s | Max: 56m 18s
  🟩 Clang16            Pass: 100%/4   | Total:  3h 33m | Avg: 53m 24s | Max: 56m 53s
  🟩 Clang17            Pass: 100%/4   | Total:  3h 33m | Avg: 53m 29s | Max: 55m 48s
  🔍 Clang18            Pass:  63%/11  | Total:  9h 06m | Avg: 49m 38s | Max: 58m 07s
  🟩 GCC6               Pass: 100%/2   | Total:  1h 28m | Avg: 44m 12s | Max: 45m 05s
  🟩 GCC7               Pass: 100%/6   | Total:  4h 50m | Avg: 48m 20s | Max: 55m 05s
  🟩 GCC8               Pass: 100%/6   | Total:  4h 54m | Avg: 49m 01s | Max: 56m 18s
  🟩 GCC9               Pass: 100%/6   | Total:  4h 56m | Avg: 49m 28s | Max: 57m 31s
  🟩 GCC10              Pass: 100%/4   | Total:  3h 40m | Avg: 55m 00s | Max: 57m 08s
  🟩 GCC11              Pass: 100%/7   | Total:  7h 00m | Avg:  1h 00m | Max:  1h 13m
  🟩 GCC12              Pass: 100%/4   | Total:  3h 41m | Avg: 55m 15s | Max: 57m 05s
  🟩 GCC13              Pass: 100%/16  | Total:  9h 02m | Avg: 33m 53s | Max: 57m 03s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 53m | Avg: 57m 55s | Max:  1h 00m
  🟩 MSVC14.16          Pass: 100%/1   | Total: 57m 30s | Avg: 57m 30s | Max: 57m 30s | Hits:  64%/737   
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 53m | Avg: 56m 44s | Max: 57m 56s | Hits:  64%/1474  
  🟩 MSVC14.39          Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m | Hits:  64%/737   
  🟩 NVHPC24.7          Pass: 100%/4   | Total:  3h 59m | Avg: 59m 57s | Max:  1h 05m
🔍 cxx_family: Clang 🔍
  🔍 Clang              Pass:  91%/48  | Total:  1d 17h | Avg: 52m 18s | Max: 58m 52s
  🟩 GCC                Pass: 100%/51  | Total:  1d 15h | Avg: 46m 31s | Max:  1h 13m
  🟩 Intel              Pass: 100%/3   | Total:  2h 53m | Avg: 57m 55s | Max:  1h 00m
  🟩 MSVC               Pass: 100%/4   | Total:  3h 52m | Avg: 58m 11s | Max:  1h 01m | Hits:  64%/2948  
  🟩 NVHPC              Pass: 100%/4   | Total:  3h 59m | Avg: 59m 57s | Max:  1h 05m
🔍 jobs: Build 🔍
  🔍 Build              Pass:  96%/102 | Total:  3d 17h | Avg: 52m 30s | Max:  1h 13m | Hits:  64%/2948  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 17m 29s | Avg: 17m 29s | Max: 17m 29s
  🟩 GraphCapture       Pass: 100%/1   | Total: 19m 38s | Avg: 19m 38s | Max: 19m 38s
  🟩 HostLaunch         Pass: 100%/3   | Total: 54m 45s | Avg: 18m 15s | Max: 20m 01s
  🟩 TestGPU            Pass: 100%/3   | Total:  1h 23m | Avg: 27m 40s | Max: 30m 05s
🟨 gpu
  🟨 v100               Pass:  96%/110 | Total:  3d 20h | Avg: 50m 16s | Max:  1h 13m | Hits:  64%/2948  
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 29m | Avg:  1h 09m | Max:  1h 13m
  🟩 90a                Pass: 100%/4   | Total:  1h 31m | Avg: 22m 51s | Max: 23m 49s
🟨 std
  🟨 11                 Pass:  96%/30  | Total:  1d 00h | Avg: 49m 34s | Max:  1h 13m
  🟨 14                 Pass:  96%/29  | Total:  1d 01h | Avg: 53m 12s | Max:  1h 09m | Hits:  64%/1474  
  🟨 17                 Pass:  96%/27  | Total: 23h 46m | Avg: 52m 49s | Max:  1h 06m | Hits:  64%/737   
  🟨 20                 Pass:  95%/24  | Total: 17h 53m | Avg: 44m 44s | Max:  1h 01m | Hits:  64%/737

🟩 thrust: Pass: 100%/109 | Total: 2d 06h | Avg: 29m 43s | Max: 58m 06s | Hits: 83%/13165

🟩 cpu
  🟩 amd64              Pass: 100%/101 | Total:  2d 02h | Avg: 29m 53s | Max: 58m 06s | Hits:  83%/13165 
  🟩 arm64              Pass: 100%/8   | Total:  3h 40m | Avg: 27m 34s | Max: 30m 36s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  7h 12m | Avg: 28m 49s | Max: 52m 43s | Hits:  79%/2633  
  🟩 11.8               Pass: 100%/3   | Total:  2h 00m | Avg: 40m 02s | Max: 44m 54s
  🟩 12.5               Pass: 100%/4   | Total:  3h 19m | Avg: 49m 47s | Max: 57m 23s
  🟩 12.6               Pass: 100%/87  | Total:  1d 17h | Avg: 28m 36s | Max: 58m 06s | Hits:  84%/10532 
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 35m | Avg: 23m 57s | Max: 27m 02s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  7h 12m | Avg: 28m 49s | Max: 52m 43s | Hits:  79%/2633  
  🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 00m | Avg: 40m 02s | Max: 44m 54s
  🟩 nvcc12.5           Pass: 100%/4   | Total:  3h 19m | Avg: 49m 47s | Max: 57m 23s
  🟩 nvcc12.6           Pass: 100%/83  | Total:  1d 15h | Avg: 28m 49s | Max: 58m 06s | Hits:  84%/10532 
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 35m | Avg: 23m 57s | Max: 27m 02s
  🟩 nvcc               Pass: 100%/105 | Total:  2d 04h | Avg: 29m 56s | Max: 58m 06s | Hits:  83%/13165 
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total:  2h 57m | Avg: 29m 37s | Max: 36m 27s
  🟩 Clang10            Pass: 100%/3   | Total:  1h 32m | Avg: 30m 51s | Max: 35m 21s
  🟩 Clang11            Pass: 100%/4   | Total:  1h 57m | Avg: 29m 15s | Max: 33m 00s
  🟩 Clang12            Pass: 100%/4   | Total:  1h 58m | Avg: 29m 33s | Max: 34m 19s
  🟩 Clang13            Pass: 100%/4   | Total:  1h 56m | Avg: 29m 05s | Max: 31m 41s
  🟩 Clang14            Pass: 100%/4   | Total:  2h 00m | Avg: 30m 09s | Max: 33m 28s
  🟩 Clang15            Pass: 100%/4   | Total:  1h 57m | Avg: 29m 28s | Max: 33m 55s
  🟩 Clang16            Pass: 100%/4   | Total:  1h 58m | Avg: 29m 34s | Max: 31m 40s
  🟩 Clang17            Pass: 100%/4   | Total:  2h 00m | Avg: 30m 13s | Max: 34m 08s
  🟩 Clang18            Pass: 100%/11  | Total:  4h 17m | Avg: 23m 25s | Max: 34m 40s
  🟩 GCC6               Pass: 100%/2   | Total: 52m 50s | Avg: 26m 25s | Max: 28m 36s
  🟩 GCC7               Pass: 100%/6   | Total:  2h 44m | Avg: 27m 26s | Max: 32m 03s
  🟩 GCC8               Pass: 100%/6   | Total:  2h 45m | Avg: 27m 32s | Max: 29m 31s
  🟩 GCC9               Pass: 100%/6   | Total:  2h 48m | Avg: 28m 09s | Max: 31m 24s
  🟩 GCC10              Pass: 100%/4   | Total:  1h 58m | Avg: 29m 42s | Max: 32m 14s
  🟩 GCC11              Pass: 100%/7   | Total:  4h 02m | Avg: 34m 35s | Max: 44m 54s
  🟩 GCC12              Pass: 100%/4   | Total:  2h 09m | Avg: 32m 15s | Max: 37m 39s
  🟩 GCC13              Pass: 100%/14  | Total:  4h 58m | Avg: 21m 20s | Max: 36m 31s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  1h 44m | Avg: 34m 57s | Max: 37m 57s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 52m 43s | Avg: 52m 43s | Max: 52m 43s | Hits:  79%/2633  
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 46m | Avg: 53m 24s | Max: 54m 06s | Hits:  79%/5266  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 18m | Avg: 39m 27s | Max: 58m 06s | Hits:  89%/5266  
  🟩 NVHPC24.7          Pass: 100%/4   | Total:  3h 19m | Avg: 49m 47s | Max: 57m 23s
🟩 cxx_family
  🟩 Clang              Pass: 100%/48  | Total: 22h 37m | Avg: 28m 16s | Max: 36m 27s
  🟩 GCC                Pass: 100%/49  | Total: 22h 20m | Avg: 27m 21s | Max: 44m 54s
  🟩 Intel              Pass: 100%/3   | Total:  1h 44m | Avg: 34m 57s | Max: 37m 57s
  🟩 MSVC               Pass: 100%/5   | Total:  3h 58m | Avg: 47m 41s | Max: 58m 06s | Hits:  83%/13165 
  🟩 NVHPC              Pass: 100%/4   | Total:  3h 19m | Avg: 49m 47s | Max: 57m 23s
🟩 gpu
  🟩 v100               Pass: 100%/109 | Total:  2d 06h | Avg: 29m 43s | Max: 58m 06s | Hits:  83%/13165 
🟩 jobs
  🟩 Build              Pass: 100%/102 | Total:  2d 04h | Avg: 30m 57s | Max: 58m 06s | Hits:  79%/10532 
  🟩 TestCPU            Pass: 100%/4   | Total: 43m 21s | Avg: 10m 50s | Max: 20m 49s | Hits:  99%/2633  
  🟩 TestGPU            Pass: 100%/3   | Total: 39m 41s | Avg: 13m 13s | Max: 14m 07s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 00m | Avg: 40m 02s | Max: 44m 54s
  🟩 90a                Pass: 100%/4   | Total:  1h 20m | Avg: 20m 05s | Max: 26m 12s
🟩 std
  🟩 11                 Pass: 100%/30  | Total: 12h 03m | Avg: 24m 07s | Max: 40m 24s
  🟩 14                 Pass: 100%/29  | Total: 15h 40m | Avg: 32m 25s | Max: 54m 06s | Hits:  79%/5266  
  🟩 17                 Pass: 100%/27  | Total: 14h 51m | Avg: 33m 00s | Max: 57m 23s | Hits:  79%/2633  
  🟩 20                 Pass: 100%/23  | Total: 11h 24m | Avg: 29m 46s | Max: 58m 06s | Hits:  89%/5266

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 25s | Avg: 4m 42s | Max: 7m 08s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 25s | Avg:  4m 42s | Max:  7m 08s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 25s | Avg:  4m 42s | Max:  7m 08s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 25s | Avg:  4m 42s | Max:  7m 08s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 25s | Avg:  4m 42s | Max:  7m 08s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 25s | Avg:  4m 42s | Max:  7m 08s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 25s | Avg:  4m 42s | Max:  7m 08s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 25s | Avg:  4m 42s | Max:  7m 08s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 17s | Avg:  2m 17s | Max:  2m 17s
  🟩 Test               Pass: 100%/1   | Total:  7m 08s | Avg:  7m 08s | Max:  7m 08s

🟩 pycuda: Pass: 100%/1 | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	pycuda
	CCCL C Parallel Library

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	pycuda
+/-	CCCL C Parallel Library

🏃‍ Runner counts (total jobs: 222)

#	Runner
184	`linux-amd64-cpu16`
16	`linux-arm64-cpu16`
13	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`

github-actions · 2024-10-31T06:15:45Z

🟩 CI finished in 1h 26m: Pass: 100%/222 | Total: 5d 23h | Avg: 38m 52s | Max: 1h 07m | Hits: 79%/16113

🟩 cub: Pass: 100%/110 | Total: 3d 18h | Avg: 49m 10s | Max: 1h 07m | Hits: 64%/2948

🟩 cpu
  🟩 amd64              Pass: 100%/102 | Total:  3d 11h | Avg: 48m 49s | Max:  1h 07m | Hits:  64%/2948  
  🟩 arm64              Pass: 100%/8   | Total:  7h 09m | Avg: 53m 39s | Max: 54m 26s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total: 11h 18m | Avg: 45m 13s | Max: 49m 44s | Hits:  64%/737   
  🟩 11.8               Pass: 100%/3   | Total:  3h 16m | Avg:  1h 05m | Max:  1h 07m
  🟩 12.5               Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 05m
  🟩 12.6               Pass: 100%/88  | Total:  2d 23h | Avg: 48m 46s | Max: 59m 53s | Hits:  64%/2211  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 37m | Avg: 54m 27s | Max: 55m 20s
  🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 18m | Avg: 45m 13s | Max: 49m 44s | Hits:  64%/737   
  🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 16m | Avg:  1h 05m | Max:  1h 07m
  🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 05m
  🟩 nvcc12.6           Pass: 100%/84  | Total:  2d 19h | Avg: 48m 29s | Max: 59m 53s | Hits:  64%/2211  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 37m | Avg: 54m 27s | Max: 55m 20s
  🟩 nvcc               Pass: 100%/106 | Total:  3d 14h | Avg: 48m 58s | Max:  1h 07m | Hits:  64%/2948  
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total:  4h 52m | Avg: 48m 46s | Max: 52m 34s
  🟩 Clang10            Pass: 100%/3   | Total:  2h 35m | Avg: 51m 52s | Max: 53m 06s
  🟩 Clang11            Pass: 100%/4   | Total:  3h 24m | Avg: 51m 05s | Max: 52m 00s
  🟩 Clang12            Pass: 100%/4   | Total:  3h 33m | Avg: 53m 25s | Max: 57m 14s
  🟩 Clang13            Pass: 100%/4   | Total:  3h 32m | Avg: 53m 12s | Max: 56m 52s
  🟩 Clang14            Pass: 100%/4   | Total:  3h 29m | Avg: 52m 19s | Max: 55m 12s
  🟩 Clang15            Pass: 100%/4   | Total:  3h 37m | Avg: 54m 17s | Max: 57m 08s
  🟩 Clang16            Pass: 100%/4   | Total:  3h 24m | Avg: 51m 05s | Max: 52m 42s
  🟩 Clang17            Pass: 100%/4   | Total:  3h 24m | Avg: 51m 10s | Max: 54m 41s
  🟩 Clang18            Pass: 100%/11  | Total:  8h 44m | Avg: 47m 41s | Max: 55m 20s
  🟩 GCC6               Pass: 100%/2   | Total:  1h 29m | Avg: 44m 50s | Max: 45m 48s
  🟩 GCC7               Pass: 100%/6   | Total:  4h 54m | Avg: 49m 04s | Max: 55m 14s
  🟩 GCC8               Pass: 100%/6   | Total:  4h 44m | Avg: 47m 27s | Max: 55m 33s
  🟩 GCC9               Pass: 100%/6   | Total:  4h 52m | Avg: 48m 40s | Max: 54m 33s
  🟩 GCC10              Pass: 100%/4   | Total:  3h 31m | Avg: 52m 51s | Max: 54m 42s
  🟩 GCC11              Pass: 100%/7   | Total:  6h 49m | Avg: 58m 28s | Max:  1h 07m
  🟩 GCC12              Pass: 100%/4   | Total:  3h 35m | Avg: 53m 51s | Max: 55m 59s
  🟩 GCC13              Pass: 100%/16  | Total:  9h 03m | Avg: 33m 57s | Max: 54m 33s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 42m | Avg: 54m 11s | Max: 55m 44s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 49m 44s | Avg: 49m 44s | Max: 49m 44s | Hits:  64%/737   
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 55m | Avg: 57m 30s | Max: 58m 42s | Hits:  64%/1474  
  🟩 MSVC14.39          Pass: 100%/1   | Total: 59m 53s | Avg: 59m 53s | Max: 59m 53s | Hits:  64%/737   
  🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 05m
🟩 cxx_family
  🟩 Clang              Pass: 100%/48  | Total:  1d 16h | Avg: 50m 48s | Max: 57m 14s
  🟩 GCC                Pass: 100%/51  | Total:  1d 15h | Avg: 45m 53s | Max:  1h 07m
  🟩 Intel              Pass: 100%/3   | Total:  2h 42m | Avg: 54m 11s | Max: 55m 44s
  🟩 MSVC               Pass: 100%/4   | Total:  3h 44m | Avg: 56m 09s | Max: 59m 53s | Hits:  64%/2948  
  🟩 NVHPC              Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 05m
🟩 gpu
  🟩 v100               Pass: 100%/110 | Total:  3d 18h | Avg: 49m 10s | Max:  1h 07m | Hits:  64%/2948  
🟩 jobs
  🟩 Build              Pass: 100%/102 | Total:  3d 15h | Avg: 51m 18s | Max:  1h 07m | Hits:  64%/2948  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 29m 38s | Avg: 29m 38s | Max: 29m 38s
  🟩 GraphCapture       Pass: 100%/1   | Total: 18m 41s | Avg: 18m 41s | Max: 18m 41s
  🟩 HostLaunch         Pass: 100%/3   | Total: 59m 37s | Avg: 19m 52s | Max: 21m 24s
  🟩 TestGPU            Pass: 100%/3   | Total:  1h 08m | Avg: 22m 44s | Max: 26m 57s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 16m | Avg:  1h 05m | Max:  1h 07m
  🟩 90a                Pass: 100%/4   | Total:  1h 27m | Avg: 21m 48s | Max: 22m 38s
🟩 std
  🟩 11                 Pass: 100%/30  | Total:  1d 00h | Avg: 48m 23s | Max:  1h 05m
  🟩 14                 Pass: 100%/29  | Total:  1d 00h | Avg: 51m 04s | Max:  1h 07m | Hits:  64%/1474  
  🟩 17                 Pass: 100%/27  | Total: 23h 00m | Avg: 51m 08s | Max:  1h 05m | Hits:  64%/737   
  🟩 20                 Pass: 100%/24  | Total: 18h 15m | Avg: 45m 39s | Max:  1h 01m | Hits:  64%/737

🟩 thrust: Pass: 100%/109 | Total: 2d 05h | Avg: 29m 18s | Max: 57m 55s | Hits: 83%/13165

🟩 cpu
  🟩 amd64              Pass: 100%/101 | Total:  2d 01h | Avg: 29m 23s | Max: 57m 55s | Hits:  83%/13165 
  🟩 arm64              Pass: 100%/8   | Total:  3h 46m | Avg: 28m 17s | Max: 31m 07s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  7h 13m | Avg: 28m 55s | Max: 57m 55s | Hits:  79%/2633  
  🟩 11.8               Pass: 100%/3   | Total:  1h 50m | Avg: 36m 50s | Max: 38m 46s
  🟩 12.5               Pass: 100%/4   | Total:  3h 23m | Avg: 50m 49s | Max: 57m 21s
  🟩 12.6               Pass: 100%/87  | Total:  1d 16h | Avg: 28m 07s | Max: 54m 01s | Hits:  84%/10532 
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 33m | Avg: 23m 20s | Max: 26m 23s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  7h 13m | Avg: 28m 55s | Max: 57m 55s | Hits:  79%/2633  
  🟩 nvcc11.8           Pass: 100%/3   | Total:  1h 50m | Avg: 36m 50s | Max: 38m 46s
  🟩 nvcc12.5           Pass: 100%/4   | Total:  3h 23m | Avg: 50m 49s | Max: 57m 21s
  🟩 nvcc12.6           Pass: 100%/83  | Total:  1d 15h | Avg: 28m 21s | Max: 54m 01s | Hits:  84%/10532 
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 33m | Avg: 23m 20s | Max: 26m 23s
  🟩 nvcc               Pass: 100%/105 | Total:  2d 03h | Avg: 29m 31s | Max: 57m 55s | Hits:  83%/13165 
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total:  2h 45m | Avg: 27m 35s | Max: 34m 59s
  🟩 Clang10            Pass: 100%/3   | Total:  1h 29m | Avg: 29m 49s | Max: 32m 52s
  🟩 Clang11            Pass: 100%/4   | Total:  1h 57m | Avg: 29m 26s | Max: 32m 57s
  🟩 Clang12            Pass: 100%/4   | Total:  1h 50m | Avg: 27m 34s | Max: 29m 38s
  🟩 Clang13            Pass: 100%/4   | Total:  1h 59m | Avg: 29m 54s | Max: 33m 29s
  🟩 Clang14            Pass: 100%/4   | Total:  1h 53m | Avg: 28m 19s | Max: 29m 10s
  🟩 Clang15            Pass: 100%/4   | Total:  1h 53m | Avg: 28m 21s | Max: 31m 05s
  🟩 Clang16            Pass: 100%/4   | Total:  2h 05m | Avg: 31m 17s | Max: 34m 15s
  🟩 Clang17            Pass: 100%/4   | Total:  2h 03m | Avg: 30m 55s | Max: 35m 19s
  🟩 Clang18            Pass: 100%/11  | Total:  4h 07m | Avg: 22m 31s | Max: 29m 59s
  🟩 GCC6               Pass: 100%/2   | Total: 49m 32s | Avg: 24m 46s | Max: 27m 58s
  🟩 GCC7               Pass: 100%/6   | Total:  2h 44m | Avg: 27m 28s | Max: 31m 15s
  🟩 GCC8               Pass: 100%/6   | Total:  2h 44m | Avg: 27m 22s | Max: 30m 55s
  🟩 GCC9               Pass: 100%/6   | Total:  2h 58m | Avg: 29m 47s | Max: 32m 37s
  🟩 GCC10              Pass: 100%/4   | Total:  1h 59m | Avg: 29m 56s | Max: 32m 46s
  🟩 GCC11              Pass: 100%/7   | Total:  3h 52m | Avg: 33m 12s | Max: 38m 46s
  🟩 GCC12              Pass: 100%/4   | Total:  1h 58m | Avg: 29m 42s | Max: 31m 51s
  🟩 GCC13              Pass: 100%/14  | Total:  4h 51m | Avg: 20m 48s | Max: 32m 10s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  1h 50m | Avg: 36m 49s | Max: 42m 41s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 57m 55s | Avg: 57m 55s | Max: 57m 55s | Hits:  79%/2633  
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 43m | Avg: 51m 56s | Max: 54m 01s | Hits:  79%/5266  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 12m | Avg: 36m 25s | Max: 52m 45s | Hits:  89%/5266  
  🟩 NVHPC24.7          Pass: 100%/4   | Total:  3h 23m | Avg: 50m 49s | Max: 57m 21s
🟩 cxx_family
  🟩 Clang              Pass: 100%/48  | Total: 22h 06m | Avg: 27m 37s | Max: 35m 19s
  🟩 GCC                Pass: 100%/49  | Total: 21h 59m | Avg: 26m 56s | Max: 38m 46s
  🟩 Intel              Pass: 100%/3   | Total:  1h 50m | Avg: 36m 49s | Max: 42m 41s
  🟩 MSVC               Pass: 100%/5   | Total:  3h 54m | Avg: 46m 55s | Max: 57m 55s | Hits:  83%/13165 
  🟩 NVHPC              Pass: 100%/4   | Total:  3h 23m | Avg: 50m 49s | Max: 57m 21s
🟩 gpu
  🟩 v100               Pass: 100%/109 | Total:  2d 05h | Avg: 29m 18s | Max: 57m 55s | Hits:  83%/13165 
🟩 jobs
  🟩 Build              Pass: 100%/102 | Total:  2d 03h | Avg: 30m 27s | Max: 57m 55s | Hits:  79%/10532 
  🟩 TestCPU            Pass: 100%/4   | Total: 41m 21s | Avg: 10m 20s | Max: 20m 06s | Hits:  99%/2633  
  🟩 TestGPU            Pass: 100%/3   | Total: 46m 03s | Avg: 15m 21s | Max: 18m 25s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total:  1h 50m | Avg: 36m 50s | Max: 38m 46s
  🟩 90a                Pass: 100%/4   | Total:  1h 04m | Avg: 16m 13s | Max: 17m 29s
🟩 std
  🟩 11                 Pass: 100%/30  | Total: 12h 06m | Avg: 24m 12s | Max: 40m 34s
  🟩 14                 Pass: 100%/29  | Total: 15h 40m | Avg: 32m 26s | Max: 57m 55s | Hits:  79%/5266  
  🟩 17                 Pass: 100%/27  | Total: 14h 33m | Avg: 32m 22s | Max: 55m 04s | Hits:  79%/2633  
  🟩 20                 Pass: 100%/23  | Total: 10h 52m | Avg: 28m 23s | Max: 57m 21s | Hits:  89%/5266

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 32s | Avg: 4m 46s | Max: 7m 09s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 32s | Avg:  4m 46s | Max:  7m 09s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 32s | Avg:  4m 46s | Max:  7m 09s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 32s | Avg:  4m 46s | Max:  7m 09s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 32s | Avg:  4m 46s | Max:  7m 09s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 32s | Avg:  4m 46s | Max:  7m 09s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 32s | Avg:  4m 46s | Max:  7m 09s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 32s | Avg:  4m 46s | Max:  7m 09s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 23s | Avg:  2m 23s | Max:  2m 23s
  🟩 Test               Pass: 100%/1   | Total:  7m 09s | Avg:  7m 09s | Max:  7m 09s

🟩 pycuda: Pass: 100%/1 | Total: 15m 49s | Avg: 15m 49s | Max: 15m 49s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 15m 49s | Avg: 15m 49s | Max: 15m 49s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 15m 49s | Avg: 15m 49s | Max: 15m 49s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 15m 49s | Avg: 15m 49s | Max: 15m 49s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 15m 49s | Avg: 15m 49s | Max: 15m 49s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 15m 49s | Avg: 15m 49s | Max: 15m 49s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 15m 49s | Avg: 15m 49s | Max: 15m 49s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 15m 49s | Avg: 15m 49s | Max: 15m 49s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 15m 49s | Avg: 15m 49s | Max: 15m 49s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	pycuda
	CCCL C Parallel Library

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	pycuda
+/-	CCCL C Parallel Library

🏃‍ Runner counts (total jobs: 222)

#	Runner
184	`linux-amd64-cpu16`
16	`linux-arm64-cpu16`
13	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`

github-actions · 2024-11-01T12:33:07Z

🟩 CI finished in 1h 11m: Pass: 100%/222 | Total: 3d 11h | Avg: 22m 40s | Max: 1h 11m | Hits: 91%/16109

🟩 cub: Pass: 100%/110 | Total: 2d 18h | Avg: 36m 06s | Max: 1h 11m | Hits: 89%/2944

🟩 cpu
  🟩 amd64              Pass: 100%/102 | Total:  2d 12h | Avg: 35m 39s | Max:  1h 11m | Hits:  89%/2944  
  🟩 arm64              Pass: 100%/8   | Total:  5h 34m | Avg: 41m 45s | Max: 42m 53s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  8h 12m | Avg: 32m 49s | Max: 45m 10s | Hits:  88%/736   
  🟩 11.8               Pass: 100%/3   | Total:  3h 13m | Avg:  1h 04m | Max:  1h 11m
  🟩 12.5               Pass: 100%/4   | Total:  2h 52m | Avg: 43m 14s | Max: 45m 12s
  🟩 12.6               Pass: 100%/88  | Total:  2d 03h | Avg: 35m 22s | Max: 54m 57s | Hits:  89%/2208  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 28m | Avg: 52m 03s | Max: 54m 57s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  8h 12m | Avg: 32m 49s | Max: 45m 10s | Hits:  88%/736   
  🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 13m | Avg:  1h 04m | Max:  1h 11m
  🟩 nvcc12.5           Pass: 100%/4   | Total:  2h 52m | Avg: 43m 14s | Max: 45m 12s
  🟩 nvcc12.6           Pass: 100%/84  | Total:  2d 00h | Avg: 34m 35s | Max: 49m 43s | Hits:  89%/2208  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 28m | Avg: 52m 03s | Max: 54m 57s
  🟩 nvcc               Pass: 100%/106 | Total:  2d 14h | Avg: 35m 30s | Max:  1h 11m | Hits:  89%/2944  
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total:  3h 25m | Avg: 34m 16s | Max: 40m 05s
  🟩 Clang10            Pass: 100%/3   | Total:  1h 56m | Avg: 38m 43s | Max: 40m 32s
  🟩 Clang11            Pass: 100%/4   | Total:  2h 32m | Avg: 38m 06s | Max: 39m 43s
  🟩 Clang12            Pass: 100%/4   | Total:  2h 21m | Avg: 35m 21s | Max: 36m 37s
  🟩 Clang13            Pass: 100%/4   | Total:  2h 22m | Avg: 35m 35s | Max: 37m 26s
  🟩 Clang14            Pass: 100%/4   | Total:  2h 25m | Avg: 36m 27s | Max: 38m 48s
  🟩 Clang15            Pass: 100%/4   | Total:  2h 31m | Avg: 37m 48s | Max: 39m 48s
  🟩 Clang16            Pass: 100%/4   | Total:  2h 26m | Avg: 36m 30s | Max: 39m 15s
  🟩 Clang17            Pass: 100%/4   | Total:  2h 22m | Avg: 35m 36s | Max: 37m 51s
  🟩 Clang18            Pass: 100%/11  | Total:  7h 27m | Avg: 40m 41s | Max: 54m 57s
  🟩 GCC6               Pass: 100%/2   | Total:  1h 05m | Avg: 32m 42s | Max: 33m 21s
  🟩 GCC7               Pass: 100%/6   | Total:  3h 21m | Avg: 33m 30s | Max: 38m 34s
  🟩 GCC8               Pass: 100%/6   | Total:  3h 34m | Avg: 35m 49s | Max: 45m 10s
  🟩 GCC9               Pass: 100%/6   | Total:  3h 20m | Avg: 33m 20s | Max: 36m 36s
  🟩 GCC10              Pass: 100%/4   | Total:  2h 27m | Avg: 36m 51s | Max: 38m 39s
  🟩 GCC11              Pass: 100%/7   | Total:  5h 40m | Avg: 48m 37s | Max:  1h 11m
  🟩 GCC12              Pass: 100%/4   | Total:  2h 34m | Avg: 38m 41s | Max: 39m 51s
  🟩 GCC13              Pass: 100%/16  | Total:  6h 23m | Avg: 23m 56s | Max: 42m 53s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  1h 54m | Avg: 38m 06s | Max: 40m 48s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 43m 49s | Avg: 43m 49s | Max: 43m 49s | Hits:  88%/736   
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 32m | Avg: 46m 12s | Max: 48m 34s | Hits:  90%/1472  
  🟩 MSVC14.39          Pass: 100%/1   | Total: 49m 43s | Avg: 49m 43s | Max: 49m 43s | Hits:  88%/736   
  🟩 NVHPC24.7          Pass: 100%/4   | Total:  2h 52m | Avg: 43m 14s | Max: 45m 12s
🟩 cxx_family
  🟩 Clang              Pass: 100%/48  | Total:  1d 05h | Avg: 37m 19s | Max: 54m 57s
  🟩 GCC                Pass: 100%/51  | Total:  1d 04h | Avg: 33m 28s | Max:  1h 11m
  🟩 Intel              Pass: 100%/3   | Total:  1h 54m | Avg: 38m 06s | Max: 40m 48s
  🟩 MSVC               Pass: 100%/4   | Total:  3h 05m | Avg: 46m 29s | Max: 49m 43s | Hits:  89%/2944  
  🟩 NVHPC              Pass: 100%/4   | Total:  2h 52m | Avg: 43m 14s | Max: 45m 12s
🟩 gpu
  🟩 v100               Pass: 100%/110 | Total:  2d 18h | Avg: 36m 06s | Max:  1h 11m | Hits:  89%/2944  
🟩 jobs
  🟩 Build              Pass: 100%/102 | Total:  2d 15h | Avg: 37m 26s | Max:  1h 11m | Hits:  89%/2944  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 17m 01s | Avg: 17m 01s | Max: 17m 01s
  🟩 GraphCapture       Pass: 100%/1   | Total: 14m 40s | Avg: 14m 40s | Max: 14m 40s
  🟩 HostLaunch         Pass: 100%/3   | Total: 57m 09s | Avg: 19m 03s | Max: 20m 28s
  🟩 TestGPU            Pass: 100%/3   | Total:  1h 03m | Avg: 21m 02s | Max: 22m 25s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 13m | Avg:  1h 04m | Max:  1h 11m
  🟩 90a                Pass: 100%/4   | Total: 27m 45s | Avg:  6m 56s | Max: 13m 14s
🟩 std
  🟩 11                 Pass: 100%/30  | Total: 17h 49m | Avg: 35m 39s | Max: 54m 36s
  🟩 14                 Pass: 100%/29  | Total: 18h 05m | Avg: 37m 25s | Max:  1h 11m | Hits:  89%/1472  
  🟩 17                 Pass: 100%/27  | Total: 17h 14m | Avg: 38m 18s | Max:  1h 10m | Hits:  90%/736   
  🟩 20                 Pass: 100%/24  | Total: 13h 02m | Avg: 32m 35s | Max: 50m 03s | Hits:  88%/736

🟩 thrust: Pass: 100%/109 | Total: 17h 20m | Avg: 9m 32s | Max: 41m 32s | Hits: 92%/13165

🟩 cpu
  🟩 amd64              Pass: 100%/101 | Total: 16h 40m | Avg:  9m 54s | Max: 41m 32s | Hits:  92%/13165 
  🟩 arm64              Pass: 100%/8   | Total: 40m 53s | Avg:  5m 06s | Max:  5m 46s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  2h 09m | Avg:  8m 37s | Max: 38m 05s | Hits:  90%/2633  
  🟩 11.8               Pass: 100%/3   | Total: 37m 38s | Avg: 12m 32s | Max: 16m 36s
  🟩 12.5               Pass: 100%/4   | Total:  2h 15m | Avg: 33m 48s | Max: 41m 32s
  🟩 12.6               Pass: 100%/87  | Total: 12h 18m | Avg:  8m 29s | Max: 38m 36s | Hits:  92%/10532 
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/4   | Total: 19m 53s | Avg:  4m 58s | Max:  5m 26s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  2h 09m | Avg:  8m 37s | Max: 38m 05s | Hits:  90%/2633  
  🟩 nvcc11.8           Pass: 100%/3   | Total: 37m 38s | Avg: 12m 32s | Max: 16m 36s
  🟩 nvcc12.5           Pass: 100%/4   | Total:  2h 15m | Avg: 33m 48s | Max: 41m 32s
  🟩 nvcc12.6           Pass: 100%/83  | Total: 11h 58m | Avg:  8m 39s | Max: 38m 36s | Hits:  92%/10532 
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/4   | Total: 19m 53s | Avg:  4m 58s | Max:  5m 26s
  🟩 nvcc               Pass: 100%/105 | Total: 17h 01m | Avg:  9m 43s | Max: 41m 32s | Hits:  92%/13165 
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total: 41m 30s | Avg:  6m 55s | Max: 11m 31s
  🟩 Clang10            Pass: 100%/3   | Total: 24m 47s | Avg:  8m 15s | Max: 12m 02s
  🟩 Clang11            Pass: 100%/4   | Total: 31m 18s | Avg:  7m 49s | Max: 10m 44s
  🟩 Clang12            Pass: 100%/4   | Total: 21m 35s | Avg:  5m 23s | Max:  5m 41s
  🟩 Clang13            Pass: 100%/4   | Total: 30m 08s | Avg:  7m 32s | Max: 10m 06s
  🟩 Clang14            Pass: 100%/4   | Total: 28m 36s | Avg:  7m 09s | Max:  9m 17s
  🟩 Clang15            Pass: 100%/4   | Total: 32m 18s | Avg:  8m 04s | Max: 10m 47s
  🟩 Clang16            Pass: 100%/4   | Total: 35m 13s | Avg:  8m 48s | Max: 10m 22s
  🟩 Clang17            Pass: 100%/4   | Total: 25m 35s | Avg:  6m 23s | Max:  9m 44s
  🟩 Clang18            Pass: 100%/11  | Total:  1h 09m | Avg:  6m 16s | Max: 11m 55s
  🟩 GCC6               Pass: 100%/2   | Total:  8m 02s | Avg:  4m 01s | Max:  4m 09s
  🟩 GCC7               Pass: 100%/6   | Total:  1h 03m | Avg: 10m 35s | Max: 33m 40s
  🟩 GCC8               Pass: 100%/6   | Total: 28m 16s | Avg:  4m 42s | Max:  5m 15s
  🟩 GCC9               Pass: 100%/6   | Total: 30m 36s | Avg:  5m 06s | Max:  6m 01s
  🟩 GCC10              Pass: 100%/4   | Total: 31m 53s | Avg:  7m 58s | Max: 11m 15s
  🟩 GCC11              Pass: 100%/7   | Total:  1h 00m | Avg:  8m 36s | Max: 16m 36s
  🟩 GCC12              Pass: 100%/4   | Total: 32m 50s | Avg:  8m 12s | Max: 11m 14s
  🟩 GCC13              Pass: 100%/14  | Total:  1h 43m | Avg:  7m 22s | Max: 19m 48s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total: 38m 16s | Avg: 12m 45s | Max: 16m 37s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 38m 05s | Avg: 38m 05s | Max: 38m 05s | Hits:  90%/2633  
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 12m | Avg: 36m 24s | Max: 38m 36s | Hits:  89%/5266  
  🟩 MSVC14.39          Pass: 100%/2   | Total: 57m 50s | Avg: 28m 55s | Max: 35m 06s | Hits:  94%/5266  
  🟩 NVHPC24.7          Pass: 100%/4   | Total:  2h 15m | Avg: 33m 48s | Max: 41m 32s
🟩 cxx_family
  🟩 Clang              Pass: 100%/48  | Total:  5h 40m | Avg:  7m 05s | Max: 12m 02s
  🟩 GCC                Pass: 100%/49  | Total:  5h 58m | Avg:  7m 19s | Max: 33m 40s
  🟩 Intel              Pass: 100%/3   | Total: 38m 16s | Avg: 12m 45s | Max: 16m 37s
  🟩 MSVC               Pass: 100%/5   | Total:  2h 48m | Avg: 33m 44s | Max: 38m 36s | Hits:  92%/13165 
  🟩 NVHPC              Pass: 100%/4   | Total:  2h 15m | Avg: 33m 48s | Max: 41m 32s
🟩 gpu
  🟩 v100               Pass: 100%/109 | Total: 17h 20m | Avg:  9m 32s | Max: 41m 32s | Hits:  92%/13165 
🟩 jobs
  🟩 Build              Pass: 100%/102 | Total: 15h 51m | Avg:  9m 19s | Max: 41m 32s | Hits:  90%/10532 
  🟩 TestCPU            Pass: 100%/4   | Total: 46m 43s | Avg: 11m 40s | Max: 22m 44s | Hits:  99%/2633  
  🟩 TestGPU            Pass: 100%/3   | Total: 42m 27s | Avg: 14m 09s | Max: 19m 48s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 37m 38s | Avg: 12m 32s | Max: 16m 36s
  🟩 90a                Pass: 100%/4   | Total: 18m 01s | Avg:  4m 30s | Max:  4m 52s
🟩 std
  🟩 11                 Pass: 100%/30  | Total:  2h 59m | Avg:  5m 58s | Max: 26m 01s
  🟩 14                 Pass: 100%/29  | Total:  5h 20m | Avg: 11m 03s | Max: 38m 36s | Hits:  90%/5266  
  🟩 17                 Pass: 100%/27  | Total:  4h 34m | Avg: 10m 10s | Max: 41m 32s | Hits:  90%/2633  
  🟩 20                 Pass: 100%/23  | Total:  4h 26m | Avg: 11m 34s | Max: 35m 06s | Hits:  94%/5266

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 48s | Avg: 4m 24s | Max: 6m 38s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  8m 48s | Avg:  4m 24s | Max:  6m 38s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  8m 48s | Avg:  4m 24s | Max:  6m 38s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 48s | Avg:  4m 24s | Max:  6m 38s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  8m 48s | Avg:  4m 24s | Max:  6m 38s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  8m 48s | Avg:  4m 24s | Max:  6m 38s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  8m 48s | Avg:  4m 24s | Max:  6m 38s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  8m 48s | Avg:  4m 24s | Max:  6m 38s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 10s | Avg:  2m 10s | Max:  2m 10s
  🟩 Test               Pass: 100%/1   | Total:  6m 38s | Avg:  6m 38s | Max:  6m 38s

🟩 python: Pass: 100%/1 | Total: 13m 41s | Avg: 13m 41s | Max: 13m 41s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 13m 41s | Avg: 13m 41s | Max: 13m 41s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 13m 41s | Avg: 13m 41s | Max: 13m 41s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 13m 41s | Avg: 13m 41s | Max: 13m 41s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 13m 41s | Avg: 13m 41s | Max: 13m 41s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 13m 41s | Avg: 13m 41s | Max: 13m 41s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 13m 41s | Avg: 13m 41s | Max: 13m 41s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 13m 41s | Avg: 13m 41s | Max: 13m 41s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 13m 41s | Avg: 13m 41s | Max: 13m 41s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 222)

#	Runner
184	`linux-amd64-cpu16`
16	`linux-arm64-cpu16`
13	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`

…-reduce-argminmax

github-actions · 2024-12-03T17:45:09Z

🟨 CI finished in 3h 24m: Pass: 99%/224 | Total: 4d 11h | Avg: 28m 54s | Max: 1h 17m | Hits: 99%/12308

🟨 cub: Pass: 99%/110 | Total: 3d 22h | Avg: 51m 25s | Max: 1h 17m | Hits: 98%/3048

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  99%/102 | Total:  3d 14h | Avg: 51m 07s | Max:  1h 17m | Hits:  98%/3048  
  🟩 arm64              Pass: 100%/8   | Total:  7h 21m | Avg: 55m 09s | Max:  1h 02m
🔍 ctk: 12.6 🔍
  🟩 11.1               Pass: 100%/15  | Total: 11h 00m | Avg: 44m 01s | Max: 47m 47s | Hits:  98%/762   
  🟩 11.8               Pass: 100%/3   | Total:  3h 19m | Avg:  1h 06m | Max:  1h 09m
  🟩 12.5               Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 03m
  🔍 12.6               Pass:  98%/88  | Total:  3d 03h | Avg: 51m 45s | Max:  1h 17m | Hits:  98%/2286  
🔍 cudacxx: nvcc12.6 🔍
  🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 36m | Avg: 54m 08s | Max: 55m 53s
  🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 00m | Avg: 44m 01s | Max: 47m 47s | Hits:  98%/762   
  🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 19m | Avg:  1h 06m | Max:  1h 09m
  🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 03m
  🔍 nvcc12.6           Pass:  98%/84  | Total:  3d 00h | Avg: 51m 38s | Max:  1h 17m | Hits:  98%/2286  
🔍 cudacxx_family: nvcc 🔍
  🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 36m | Avg: 54m 08s | Max: 55m 53s
  🔍 nvcc               Pass:  99%/106 | Total:  3d 18h | Avg: 51m 19s | Max:  1h 17m | Hits:  98%/3048  
🔍 cxx: GCC9 🔍
  🟩 Clang9             Pass: 100%/6   | Total:  4h 50m | Avg: 48m 25s | Max: 53m 28s
  🟩 Clang10            Pass: 100%/3   | Total:  2h 34m | Avg: 51m 32s | Max: 53m 35s
  🟩 Clang11            Pass: 100%/4   | Total:  3h 26m | Avg: 51m 30s | Max: 55m 17s
  🟩 Clang12            Pass: 100%/4   | Total:  3h 29m | Avg: 52m 15s | Max: 53m 39s
  🟩 Clang13            Pass: 100%/4   | Total:  3h 19m | Avg: 49m 48s | Max: 51m 21s
  🟩 Clang14            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 51s | Max: 55m 24s
  🟩 Clang15            Pass: 100%/4   | Total:  3h 21m | Avg: 50m 15s | Max: 53m 21s
  🟩 Clang16            Pass: 100%/4   | Total:  3h 32m | Avg: 53m 14s | Max: 54m 27s
  🟩 Clang17            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 53s | Max: 55m 57s
  🟩 Clang18            Pass: 100%/11  | Total: 10h 37m | Avg: 57m 59s | Max:  1h 15m
  🟩 GCC6               Pass: 100%/2   | Total:  1h 33m | Avg: 46m 57s | Max: 47m 47s
  🟩 GCC7               Pass: 100%/6   | Total:  4h 45m | Avg: 47m 33s | Max: 54m 06s
  🟩 GCC8               Pass: 100%/6   | Total:  4h 43m | Avg: 47m 16s | Max: 52m 15s
  🔍 GCC9               Pass:  83%/6   | Total:  4h 49m | Avg: 48m 14s | Max: 54m 34s
  🟩 GCC10              Pass: 100%/4   | Total:  3h 32m | Avg: 53m 08s | Max: 55m 51s
  🟩 GCC11              Pass: 100%/7   | Total:  6h 46m | Avg: 58m 07s | Max:  1h 09m
  🟩 GCC12              Pass: 100%/4   | Total:  3h 33m | Avg: 53m 26s | Max:  1h 00m
  🟩 GCC13              Pass: 100%/16  | Total: 14h 14m | Avg: 53m 25s | Max:  1h 17m
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 40m | Avg: 53m 22s | Max: 56m 01s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 25m 07s | Avg: 25m 07s | Max: 25m 07s | Hits:  98%/762   
  🟩 MSVC14.29          Pass: 100%/2   | Total: 34m 15s | Avg: 17m 07s | Max: 18m 41s | Hits:  98%/1524  
  🟩 MSVC14.39          Pass: 100%/1   | Total: 19m 52s | Avg: 19m 52s | Max: 19m 52s | Hits:  98%/762   
  🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 03m
🔍 cxx_family: GCC 🔍
  🟩 Clang              Pass: 100%/48  | Total:  1d 18h | Avg: 52m 47s | Max:  1h 15m
  🔍 GCC                Pass:  98%/51  | Total:  1d 20h | Avg: 51m 46s | Max:  1h 17m
  🟩 Intel              Pass: 100%/3   | Total:  2h 40m | Avg: 53m 22s | Max: 56m 01s
  🟩 MSVC               Pass: 100%/4   | Total:  1h 19m | Avg: 19m 48s | Max: 25m 07s | Hits:  98%/3048  
  🟩 NVHPC              Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 03m
🔍 jobs: Build 🔍
  🔍 Build              Pass:  99%/102 | Total:  3d 12h | Avg: 49m 44s | Max:  1h 09m | Hits:  98%/3048  
  🟩 DeviceLaunch       Pass: 100%/1   | Total:  1h 05m | Avg:  1h 05m | Max:  1h 05m
  🟩 GraphCapture       Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m
  🟩 HostLaunch         Pass: 100%/3   | Total:  3h 40m | Avg:  1h 13m | Max:  1h 15m
  🟩 TestGPU            Pass: 100%/3   | Total:  3h 49m | Avg:  1h 16m | Max:  1h 17m
🔍 std: 14 🔍
  🟩 11                 Pass: 100%/30  | Total:  1d 02h | Avg: 52m 16s | Max:  1h 15m
  🔍 14                 Pass:  96%/29  | Total: 23h 11m | Avg: 47m 58s | Max:  1h 05m | Hits:  98%/1524  
  🟩 17                 Pass: 100%/27  | Total: 22h 17m | Avg: 49m 31s | Max:  1h 04m | Hits:  98%/762   
  🟩 20                 Pass: 100%/24  | Total: 22h 39m | Avg: 56m 39s | Max:  1h 17m | Hits:  98%/762   
🟨 gpu
  🟨 v100               Pass:  99%/110 | Total:  3d 22h | Avg: 51m 25s | Max:  1h 17m | Hits:  98%/3048  
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 19m | Avg:  1h 06m | Max:  1h 09m
  🟩 90a                Pass: 100%/4   | Total:  1h 31m | Avg: 22m 57s | Max: 23m 52s

🟩 thrust: Pass: 100%/111 | Total: 13h 15m | Avg: 7m 09s | Max: 27m 42s | Hits: 99%/9260

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 26m 37s | Avg: 13m 18s | Max: 20m 24s
🟩 cpu
  🟩 amd64              Pass: 100%/103 | Total: 12h 34m | Avg:  7m 19s | Max: 27m 42s | Hits:  99%/9260  
  🟩 arm64              Pass: 100%/8   | Total: 41m 14s | Avg:  5m 09s | Max:  6m 07s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  1h 28m | Avg:  5m 53s | Max: 21m 07s | Hits:  99%/1852  
  🟩 11.8               Pass: 100%/3   | Total: 17m 13s | Avg:  5m 44s | Max:  5m 59s
  🟩 12.5               Pass: 100%/4   | Total: 55m 52s | Avg: 13m 58s | Max: 14m 26s
  🟩 12.6               Pass: 100%/89  | Total: 10h 33m | Avg:  7m 07s | Max: 27m 42s | Hits:  99%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  5m 59s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 28m | Avg:  5m 53s | Max: 21m 07s | Hits:  99%/1852  
  🟩 nvcc11.8           Pass: 100%/3   | Total: 17m 13s | Avg:  5m 44s | Max:  5m 59s
  🟩 nvcc12.5           Pass: 100%/4   | Total: 55m 52s | Avg: 13m 58s | Max: 14m 26s
  🟩 nvcc12.6           Pass: 100%/85  | Total: 10h 12m | Avg:  7m 12s | Max: 27m 42s | Hits:  99%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  5m 59s
  🟩 nvcc               Pass: 100%/107 | Total: 12h 53m | Avg:  7m 13s | Max: 27m 42s | Hits:  99%/9260  
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total: 34m 14s | Avg:  5m 42s | Max:  6m 52s
  🟩 Clang10            Pass: 100%/3   | Total: 19m 44s | Avg:  6m 34s | Max:  6m 55s
  🟩 Clang11            Pass: 100%/4   | Total: 21m 43s | Avg:  5m 25s | Max:  5m 58s
  🟩 Clang12            Pass: 100%/4   | Total: 22m 18s | Avg:  5m 34s | Max:  5m 55s
  🟩 Clang13            Pass: 100%/4   | Total: 23m 14s | Avg:  5m 48s | Max:  6m 13s
  🟩 Clang14            Pass: 100%/4   | Total: 21m 57s | Avg:  5m 29s | Max:  5m 45s
  🟩 Clang15            Pass: 100%/4   | Total: 21m 57s | Avg:  5m 29s | Max:  5m 53s
  🟩 Clang16            Pass: 100%/4   | Total: 23m 17s | Avg:  5m 49s | Max:  6m 23s
  🟩 Clang17            Pass: 100%/4   | Total: 23m 01s | Avg:  5m 45s | Max:  6m 14s
  🟩 Clang18            Pass: 100%/11  | Total:  1h 07m | Avg:  6m 09s | Max: 12m 26s
  🟩 GCC6               Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  4m 45s
  🟩 GCC7               Pass: 100%/6   | Total: 30m 06s | Avg:  5m 01s | Max:  5m 47s
  🟩 GCC8               Pass: 100%/6   | Total: 53m 59s | Avg:  8m 59s | Max: 27m 42s
  🟩 GCC9               Pass: 100%/6   | Total: 31m 19s | Avg:  5m 13s | Max:  5m 55s
  🟩 GCC10              Pass: 100%/4   | Total: 22m 14s | Avg:  5m 33s | Max:  6m 17s
  🟩 GCC11              Pass: 100%/7   | Total: 41m 53s | Avg:  5m 59s | Max:  6m 50s
  🟩 GCC12              Pass: 100%/4   | Total: 24m 20s | Avg:  6m 05s | Max:  6m 18s
  🟩 GCC13              Pass: 100%/16  | Total:  1h 59m | Avg:  7m 27s | Max: 20m 24s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total: 21m 14s | Avg:  7m 04s | Max:  7m 28s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 21m 07s | Avg: 21m 07s | Max: 21m 07s | Hits:  99%/1852  
  🟩 MSVC14.29          Pass: 100%/2   | Total: 40m 56s | Avg: 20m 28s | Max: 22m 15s | Hits:  99%/3704  
  🟩 MSVC14.39          Pass: 100%/2   | Total: 44m 41s | Avg: 22m 20s | Max: 24m 53s | Hits:  99%/3704  
  🟩 NVHPC24.7          Pass: 100%/4   | Total: 55m 52s | Avg: 13m 58s | Max: 14m 26s
🟩 cxx_family
  🟩 Clang              Pass: 100%/48  | Total:  4h 39m | Avg:  5m 49s | Max: 12m 26s
  🟩 GCC                Pass: 100%/51  | Total:  5h 32m | Avg:  6m 30s | Max: 27m 42s
  🟩 Intel              Pass: 100%/3   | Total: 21m 14s | Avg:  7m 04s | Max:  7m 28s
  🟩 MSVC               Pass: 100%/5   | Total:  1h 46m | Avg: 21m 20s | Max: 24m 53s | Hits:  99%/9260  
  🟩 NVHPC              Pass: 100%/4   | Total: 55m 52s | Avg: 13m 58s | Max: 14m 26s
🟩 gpu
  🟩 v100               Pass: 100%/111 | Total: 13h 15m | Avg:  7m 09s | Max: 27m 42s | Hits:  99%/9260  
🟩 jobs
  🟩 Build              Pass: 100%/103 | Total: 11h 28m | Avg:  6m 41s | Max: 27m 42s | Hits:  99%/7408  
  🟩 TestCPU            Pass: 100%/4   | Total: 48m 32s | Avg: 12m 08s | Max: 24m 53s | Hits:  99%/1852  
  🟩 TestGPU            Pass: 100%/4   | Total: 58m 00s | Avg: 14m 30s | Max: 20m 24s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 17m 13s | Avg:  5m 44s | Max:  5m 59s
  🟩 90a                Pass: 100%/4   | Total: 18m 26s | Avg:  4m 36s | Max:  4m 58s
🟩 std
  🟩 11                 Pass: 100%/30  | Total:  3h 13m | Avg:  6m 26s | Max: 27m 42s
  🟩 14                 Pass: 100%/29  | Total:  3h 21m | Avg:  6m 56s | Max: 22m 15s | Hits:  99%/3704  
  🟩 17                 Pass: 100%/27  | Total:  2h 58m | Avg:  6m 35s | Max: 18m 41s | Hits:  99%/1852  
  🟩 20                 Pass: 100%/23  | Total:  3h 16m | Avg:  8m 31s | Max: 24m 53s | Hits:  99%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 14s | Avg: 5m 07s | Max: 7m 44s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 30s | Avg:  2m 30s | Max:  2m 30s
  🟩 Test               Pass: 100%/1   | Total:  7m 44s | Avg:  7m 44s | Max:  7m 44s

🟩 python: Pass: 100%/1 | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 224)

#	Runner
185	`linux-amd64-cpu16`
16	`linux-arm64-cpu16`
14	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`

cub/benchmarks/bench/reduce/arg_extrema.cu

cub/cub/device/dispatch/dispatch_streaming_reduce.cuh

github-actions · 2024-12-04T10:33:00Z

🟨 CI finished in 20h 11m: Pass: 99%/224 | Total: 4d 11h | Avg: 28m 51s | Max: 1h 17m | Hits: 99%/12308

🟨 cub: Pass: 99%/110 | Total: 3d 22h | Avg: 51m 19s | Max: 1h 17m | Hits: 98%/3048

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  99%/102 | Total:  3d 14h | Avg: 51m 01s | Max:  1h 17m | Hits:  98%/3048  
  🟩 arm64              Pass: 100%/8   | Total:  7h 21m | Avg: 55m 09s | Max:  1h 02m
🔍 ctk: 12.6 🔍
  🟩 11.1               Pass: 100%/15  | Total: 11h 00m | Avg: 44m 01s | Max: 47m 47s | Hits:  98%/762   
  🟩 11.8               Pass: 100%/3   | Total:  3h 19m | Avg:  1h 06m | Max:  1h 09m
  🟩 12.5               Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 03m
  🔍 12.6               Pass:  98%/88  | Total:  3d 03h | Avg: 51m 37s | Max:  1h 17m | Hits:  98%/2286  
🔍 cudacxx: nvcc12.6 🔍
  🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 36m | Avg: 54m 08s | Max: 55m 53s
  🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 00m | Avg: 44m 01s | Max: 47m 47s | Hits:  98%/762   
  🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 19m | Avg:  1h 06m | Max:  1h 09m
  🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 03m
  🔍 nvcc12.6           Pass:  98%/84  | Total:  3d 00h | Avg: 51m 30s | Max:  1h 17m | Hits:  98%/2286  
🔍 cudacxx_family: nvcc 🔍
  🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 36m | Avg: 54m 08s | Max: 55m 53s
  🔍 nvcc               Pass:  99%/106 | Total:  3d 18h | Avg: 51m 13s | Max:  1h 17m | Hits:  98%/3048  
🔍 cxx: GCC9 🔍
  🟩 Clang9             Pass: 100%/6   | Total:  4h 50m | Avg: 48m 25s | Max: 53m 28s
  🟩 Clang10            Pass: 100%/3   | Total:  2h 34m | Avg: 51m 32s | Max: 53m 35s
  🟩 Clang11            Pass: 100%/4   | Total:  3h 26m | Avg: 51m 30s | Max: 55m 17s
  🟩 Clang12            Pass: 100%/4   | Total:  3h 29m | Avg: 52m 15s | Max: 53m 39s
  🟩 Clang13            Pass: 100%/4   | Total:  3h 19m | Avg: 49m 48s | Max: 51m 21s
  🟩 Clang14            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 51s | Max: 55m 24s
  🟩 Clang15            Pass: 100%/4   | Total:  3h 21m | Avg: 50m 15s | Max: 53m 21s
  🟩 Clang16            Pass: 100%/4   | Total:  3h 32m | Avg: 53m 14s | Max: 54m 27s
  🟩 Clang17            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 53s | Max: 55m 57s
  🟩 Clang18            Pass: 100%/11  | Total: 10h 37m | Avg: 57m 59s | Max:  1h 15m
  🟩 GCC6               Pass: 100%/2   | Total:  1h 33m | Avg: 46m 57s | Max: 47m 47s
  🟩 GCC7               Pass: 100%/6   | Total:  4h 45m | Avg: 47m 33s | Max: 54m 06s
  🟩 GCC8               Pass: 100%/6   | Total:  4h 43m | Avg: 47m 16s | Max: 52m 15s
  🔍 GCC9               Pass:  83%/6   | Total:  4h 38m | Avg: 46m 26s | Max: 54m 34s
  🟩 GCC10              Pass: 100%/4   | Total:  3h 32m | Avg: 53m 08s | Max: 55m 51s
  🟩 GCC11              Pass: 100%/7   | Total:  6h 46m | Avg: 58m 07s | Max:  1h 09m
  🟩 GCC12              Pass: 100%/4   | Total:  3h 33m | Avg: 53m 26s | Max:  1h 00m
  🟩 GCC13              Pass: 100%/16  | Total: 14h 14m | Avg: 53m 25s | Max:  1h 17m
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 40m | Avg: 53m 22s | Max: 56m 01s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 25m 07s | Avg: 25m 07s | Max: 25m 07s | Hits:  98%/762   
  🟩 MSVC14.29          Pass: 100%/2   | Total: 34m 15s | Avg: 17m 07s | Max: 18m 41s | Hits:  98%/1524  
  🟩 MSVC14.39          Pass: 100%/1   | Total: 19m 52s | Avg: 19m 52s | Max: 19m 52s | Hits:  98%/762   
  🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 03m
🔍 cxx_family: GCC 🔍
  🟩 Clang              Pass: 100%/48  | Total:  1d 18h | Avg: 52m 47s | Max:  1h 15m
  🔍 GCC                Pass:  98%/51  | Total:  1d 19h | Avg: 51m 33s | Max:  1h 17m
  🟩 Intel              Pass: 100%/3   | Total:  2h 40m | Avg: 53m 22s | Max: 56m 01s
  🟩 MSVC               Pass: 100%/4   | Total:  1h 19m | Avg: 19m 48s | Max: 25m 07s | Hits:  98%/3048  
  🟩 NVHPC              Pass: 100%/4   | Total:  4h 02m | Avg:  1h 00m | Max:  1h 03m
🔍 jobs: Build 🔍
  🔍 Build              Pass:  99%/102 | Total:  3d 12h | Avg: 49m 38s | Max:  1h 09m | Hits:  98%/3048  
  🟩 DeviceLaunch       Pass: 100%/1   | Total:  1h 05m | Avg:  1h 05m | Max:  1h 05m
  🟩 GraphCapture       Pass: 100%/1   | Total:  1h 07m | Avg:  1h 07m | Max:  1h 07m
  🟩 HostLaunch         Pass: 100%/3   | Total:  3h 40m | Avg:  1h 13m | Max:  1h 15m
  🟩 TestGPU            Pass: 100%/3   | Total:  3h 49m | Avg:  1h 16m | Max:  1h 17m
🔍 std: 14 🔍
  🟩 11                 Pass: 100%/30  | Total:  1d 02h | Avg: 52m 16s | Max:  1h 15m
  🔍 14                 Pass:  96%/29  | Total: 23h 00m | Avg: 47m 36s | Max:  1h 05m | Hits:  98%/1524  
  🟩 17                 Pass: 100%/27  | Total: 22h 17m | Avg: 49m 31s | Max:  1h 04m | Hits:  98%/762   
  🟩 20                 Pass: 100%/24  | Total: 22h 39m | Avg: 56m 39s | Max:  1h 17m | Hits:  98%/762   
🟨 gpu
  🟨 v100               Pass:  99%/110 | Total:  3d 22h | Avg: 51m 19s | Max:  1h 17m | Hits:  98%/3048  
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 19m | Avg:  1h 06m | Max:  1h 09m
  🟩 90a                Pass: 100%/4   | Total:  1h 31m | Avg: 22m 57s | Max: 23m 52s

🟩 thrust: Pass: 100%/111 | Total: 13h 15m | Avg: 7m 09s | Max: 27m 42s | Hits: 99%/9260

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 26m 37s | Avg: 13m 18s | Max: 20m 24s
🟩 cpu
  🟩 amd64              Pass: 100%/103 | Total: 12h 34m | Avg:  7m 19s | Max: 27m 42s | Hits:  99%/9260  
  🟩 arm64              Pass: 100%/8   | Total: 41m 14s | Avg:  5m 09s | Max:  6m 07s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  1h 28m | Avg:  5m 53s | Max: 21m 07s | Hits:  99%/1852  
  🟩 11.8               Pass: 100%/3   | Total: 17m 13s | Avg:  5m 44s | Max:  5m 59s
  🟩 12.5               Pass: 100%/4   | Total: 55m 52s | Avg: 13m 58s | Max: 14m 26s
  🟩 12.6               Pass: 100%/89  | Total: 10h 33m | Avg:  7m 07s | Max: 27m 42s | Hits:  99%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  5m 59s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 28m | Avg:  5m 53s | Max: 21m 07s | Hits:  99%/1852  
  🟩 nvcc11.8           Pass: 100%/3   | Total: 17m 13s | Avg:  5m 44s | Max:  5m 59s
  🟩 nvcc12.5           Pass: 100%/4   | Total: 55m 52s | Avg: 13m 58s | Max: 14m 26s
  🟩 nvcc12.6           Pass: 100%/85  | Total: 10h 12m | Avg:  7m 12s | Max: 27m 42s | Hits:  99%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/4   | Total: 21m 28s | Avg:  5m 22s | Max:  5m 59s
  🟩 nvcc               Pass: 100%/107 | Total: 12h 53m | Avg:  7m 13s | Max: 27m 42s | Hits:  99%/9260  
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total: 34m 14s | Avg:  5m 42s | Max:  6m 52s
  🟩 Clang10            Pass: 100%/3   | Total: 19m 44s | Avg:  6m 34s | Max:  6m 55s
  🟩 Clang11            Pass: 100%/4   | Total: 21m 43s | Avg:  5m 25s | Max:  5m 58s
  🟩 Clang12            Pass: 100%/4   | Total: 22m 18s | Avg:  5m 34s | Max:  5m 55s
  🟩 Clang13            Pass: 100%/4   | Total: 23m 14s | Avg:  5m 48s | Max:  6m 13s
  🟩 Clang14            Pass: 100%/4   | Total: 21m 57s | Avg:  5m 29s | Max:  5m 45s
  🟩 Clang15            Pass: 100%/4   | Total: 21m 57s | Avg:  5m 29s | Max:  5m 53s
  🟩 Clang16            Pass: 100%/4   | Total: 23m 17s | Avg:  5m 49s | Max:  6m 23s
  🟩 Clang17            Pass: 100%/4   | Total: 23m 01s | Avg:  5m 45s | Max:  6m 14s
  🟩 Clang18            Pass: 100%/11  | Total:  1h 07m | Avg:  6m 09s | Max: 12m 26s
  🟩 GCC6               Pass: 100%/2   | Total:  9m 10s | Avg:  4m 35s | Max:  4m 45s
  🟩 GCC7               Pass: 100%/6   | Total: 30m 06s | Avg:  5m 01s | Max:  5m 47s
  🟩 GCC8               Pass: 100%/6   | Total: 53m 59s | Avg:  8m 59s | Max: 27m 42s
  🟩 GCC9               Pass: 100%/6   | Total: 31m 19s | Avg:  5m 13s | Max:  5m 55s
  🟩 GCC10              Pass: 100%/4   | Total: 22m 14s | Avg:  5m 33s | Max:  6m 17s
  🟩 GCC11              Pass: 100%/7   | Total: 41m 53s | Avg:  5m 59s | Max:  6m 50s
  🟩 GCC12              Pass: 100%/4   | Total: 24m 20s | Avg:  6m 05s | Max:  6m 18s
  🟩 GCC13              Pass: 100%/16  | Total:  1h 59m | Avg:  7m 27s | Max: 20m 24s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total: 21m 14s | Avg:  7m 04s | Max:  7m 28s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 21m 07s | Avg: 21m 07s | Max: 21m 07s | Hits:  99%/1852  
  🟩 MSVC14.29          Pass: 100%/2   | Total: 40m 56s | Avg: 20m 28s | Max: 22m 15s | Hits:  99%/3704  
  🟩 MSVC14.39          Pass: 100%/2   | Total: 44m 41s | Avg: 22m 20s | Max: 24m 53s | Hits:  99%/3704  
  🟩 NVHPC24.7          Pass: 100%/4   | Total: 55m 52s | Avg: 13m 58s | Max: 14m 26s
🟩 cxx_family
  🟩 Clang              Pass: 100%/48  | Total:  4h 39m | Avg:  5m 49s | Max: 12m 26s
  🟩 GCC                Pass: 100%/51  | Total:  5h 32m | Avg:  6m 30s | Max: 27m 42s
  🟩 Intel              Pass: 100%/3   | Total: 21m 14s | Avg:  7m 04s | Max:  7m 28s
  🟩 MSVC               Pass: 100%/5   | Total:  1h 46m | Avg: 21m 20s | Max: 24m 53s | Hits:  99%/9260  
  🟩 NVHPC              Pass: 100%/4   | Total: 55m 52s | Avg: 13m 58s | Max: 14m 26s
🟩 gpu
  🟩 v100               Pass: 100%/111 | Total: 13h 15m | Avg:  7m 09s | Max: 27m 42s | Hits:  99%/9260  
🟩 jobs
  🟩 Build              Pass: 100%/103 | Total: 11h 28m | Avg:  6m 41s | Max: 27m 42s | Hits:  99%/7408  
  🟩 TestCPU            Pass: 100%/4   | Total: 48m 32s | Avg: 12m 08s | Max: 24m 53s | Hits:  99%/1852  
  🟩 TestGPU            Pass: 100%/4   | Total: 58m 00s | Avg: 14m 30s | Max: 20m 24s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 17m 13s | Avg:  5m 44s | Max:  5m 59s
  🟩 90a                Pass: 100%/4   | Total: 18m 26s | Avg:  4m 36s | Max:  4m 58s
🟩 std
  🟩 11                 Pass: 100%/30  | Total:  3h 13m | Avg:  6m 26s | Max: 27m 42s
  🟩 14                 Pass: 100%/29  | Total:  3h 21m | Avg:  6m 56s | Max: 22m 15s | Hits:  99%/3704  
  🟩 17                 Pass: 100%/27  | Total:  2h 58m | Avg:  6m 35s | Max: 18m 41s | Hits:  99%/1852  
  🟩 20                 Pass: 100%/23  | Total:  3h 16m | Avg:  8m 31s | Max: 24m 53s | Hits:  99%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 14s | Avg: 5m 07s | Max: 7m 44s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 10m 14s | Avg:  5m 07s | Max:  7m 44s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 30s | Avg:  2m 30s | Max:  2m 30s
  🟩 Test               Pass: 100%/1   | Total:  7m 44s | Avg:  7m 44s | Max:  7m 44s

🟩 python: Pass: 100%/1 | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 224)

#	Runner
185	`linux-amd64-cpu16`
16	`linux-arm64-cpu16`
14	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`

github-actions · 2024-12-04T15:43:36Z

🟨 CI finished in 2h 03m: Pass: 99%/224 | Total: 3d 02h | Avg: 19m 56s | Max: 55m 23s | Hits: 99%/12308

🟨 cub: Pass: 99%/110 | Total: 2d 12h | Avg: 33m 01s | Max: 55m 23s | Hits: 99%/3048

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  99%/102 | Total:  2d 06h | Avg: 32m 19s | Max: 55m 23s | Hits:  99%/3048  
  🟩 arm64              Pass: 100%/8   | Total:  5h 36m | Avg: 42m 03s | Max: 47m 37s
🔍 ctk: 12.6 🔍
  🟩 11.1               Pass: 100%/15  | Total:  7h 38m | Avg: 30m 35s | Max: 33m 17s | Hits:  99%/762   
  🟩 11.8               Pass: 100%/3   | Total:  2h 17m | Avg: 45m 52s | Max: 46m 26s
  🟩 12.5               Pass: 100%/4   | Total:  2h 08m | Avg: 32m 12s | Max: 39m 55s
  🔍 12.6               Pass:  98%/88  | Total:  2d 00h | Avg: 33m 02s | Max: 55m 23s | Hits:  99%/2286  
🔍 cudacxx: nvcc12.6 🔍
  🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 31m | Avg: 52m 47s | Max: 55m 23s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  7h 38m | Avg: 30m 35s | Max: 33m 17s | Hits:  99%/762   
  🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 17m | Avg: 45m 52s | Max: 46m 26s
  🟩 nvcc12.5           Pass: 100%/4   | Total:  2h 08m | Avg: 32m 12s | Max: 39m 55s
  🔍 nvcc12.6           Pass:  98%/84  | Total:  1d 20h | Avg: 32m 06s | Max: 53m 29s | Hits:  99%/2286  
🔍 cudacxx_family: nvcc 🔍
  🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 31m | Avg: 52m 47s | Max: 55m 23s
  🔍 nvcc               Pass:  99%/106 | Total:  2d 09h | Avg: 32m 17s | Max: 53m 29s | Hits:  99%/3048  
🔍 cxx: GCC9 🔍
  🟩 Clang9             Pass: 100%/6   | Total:  3h 26m | Avg: 34m 27s | Max: 38m 16s
  🟩 Clang10            Pass: 100%/3   | Total:  1h 45m | Avg: 35m 16s | Max: 36m 18s
  🟩 Clang11            Pass: 100%/4   | Total:  2h 21m | Avg: 35m 25s | Max: 37m 45s
  🟩 Clang12            Pass: 100%/4   | Total:  1h 46m | Avg: 26m 37s | Max: 33m 50s
  🟩 Clang13            Pass: 100%/4   | Total:  2h 23m | Avg: 35m 57s | Max: 38m 08s
  🟩 Clang14            Pass: 100%/4   | Total:  2h 24m | Avg: 36m 00s | Max: 37m 09s
  🟩 Clang15            Pass: 100%/4   | Total:  2h 19m | Avg: 34m 49s | Max: 36m 06s
  🟩 Clang16            Pass: 100%/4   | Total:  2h 24m | Avg: 36m 03s | Max: 37m 54s
  🟩 Clang17            Pass: 100%/4   | Total:  2h 17m | Avg: 34m 21s | Max: 35m 48s
  🟩 Clang18            Pass: 100%/11  | Total:  7h 33m | Avg: 41m 12s | Max: 55m 23s
  🟩 GCC6               Pass: 100%/2   | Total:  1h 03m | Avg: 31m 45s | Max: 33m 17s
  🟩 GCC7               Pass: 100%/6   | Total:  3h 11m | Avg: 31m 56s | Max: 34m 37s
  🟩 GCC8               Pass: 100%/6   | Total:  3h 18m | Avg: 33m 01s | Max: 36m 46s
  🔍 GCC9               Pass:  83%/6   | Total:  2h 49m | Avg: 28m 14s | Max: 34m 59s
  🟩 GCC10              Pass: 100%/4   | Total:  2h 42m | Avg: 40m 30s | Max: 53m 29s
  🟩 GCC11              Pass: 100%/7   | Total:  4h 37m | Avg: 39m 35s | Max: 46m 26s
  🟩 GCC12              Pass: 100%/4   | Total:  2h 19m | Avg: 34m 46s | Max: 36m 58s
  🟩 GCC13              Pass: 100%/16  | Total:  6h 52m | Avg: 25m 48s | Max: 47m 37s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  1h 49m | Avg: 36m 27s | Max: 36m 39s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 19m 19s | Avg: 19m 19s | Max: 19m 19s | Hits:  99%/762   
  🟩 MSVC14.29          Pass: 100%/2   | Total: 25m 50s | Avg: 12m 55s | Max: 13m 38s | Hits:  99%/1524  
  🟩 MSVC14.39          Pass: 100%/1   | Total: 13m 12s | Avg: 13m 12s | Max: 13m 12s | Hits:  99%/762   
  🟩 NVHPC24.7          Pass: 100%/4   | Total:  2h 08m | Avg: 32m 12s | Max: 39m 55s
🔍 cxx_family: GCC 🔍
  🟩 Clang              Pass: 100%/48  | Total:  1d 04h | Avg: 35m 53s | Max: 55m 23s
  🔍 GCC                Pass:  98%/51  | Total:  1d 02h | Avg: 31m 38s | Max: 53m 29s
  🟩 Intel              Pass: 100%/3   | Total:  1h 49m | Avg: 36m 27s | Max: 36m 39s
  🟩 MSVC               Pass: 100%/4   | Total: 58m 21s | Avg: 14m 35s | Max: 19m 19s | Hits:  99%/3048  
  🟩 NVHPC              Pass: 100%/4   | Total:  2h 08m | Avg: 32m 12s | Max: 39m 55s
🔍 jobs: Build 🔍
  🔍 Build              Pass:  99%/102 | Total:  2d 08h | Avg: 33m 15s | Max: 55m 23s | Hits:  99%/3048  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 24m 26s | Avg: 24m 26s | Max: 24m 26s
  🟩 GraphCapture       Pass: 100%/1   | Total: 19m 10s | Avg: 19m 10s | Max: 19m 10s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 38m | Avg: 32m 47s | Max: 40m 02s
  🟩 TestGPU            Pass: 100%/3   | Total:  1h 39m | Avg: 33m 08s | Max: 46m 40s
🔍 std: 14 🔍
  🟩 11                 Pass: 100%/30  | Total: 16h 45m | Avg: 33m 31s | Max: 55m 23s
  🔍 14                 Pass:  96%/29  | Total: 16h 27m | Avg: 34m 02s | Max: 50m 52s | Hits:  99%/1524  
  🟩 17                 Pass: 100%/27  | Total: 15h 14m | Avg: 33m 51s | Max: 54m 47s | Hits:  99%/762   
  🟩 20                 Pass: 100%/24  | Total: 12h 06m | Avg: 30m 15s | Max: 50m 08s | Hits:  99%/762   
🟨 gpu
  🟨 v100               Pass:  99%/110 | Total:  2d 12h | Avg: 33m 01s | Max: 55m 23s | Hits:  99%/3048  
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 17m | Avg: 45m 52s | Max: 46m 26s
  🟩 90a                Pass: 100%/4   | Total: 56m 57s | Avg: 14m 14s | Max: 14m 48s

🟩 thrust: Pass: 100%/111 | Total: 13h 26m | Avg: 7m 16s | Max: 29m 32s | Hits: 99%/9260

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 17m 31s | Avg:  8m 45s | Max: 11m 13s
🟩 cpu
  🟩 amd64              Pass: 100%/103 | Total: 12h 46m | Avg:  7m 26s | Max: 29m 32s | Hits:  99%/9260  
  🟩 arm64              Pass: 100%/8   | Total: 40m 17s | Avg:  5m 02s | Max:  5m 48s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  1h 48m | Avg:  7m 12s | Max: 29m 32s | Hits:  99%/1852  
  🟩 11.8               Pass: 100%/3   | Total: 17m 20s | Avg:  5m 46s | Max:  6m 19s
  🟩 12.5               Pass: 100%/4   | Total: 57m 30s | Avg: 14m 22s | Max: 15m 14s
  🟩 12.6               Pass: 100%/89  | Total: 10h 23m | Avg:  7m 00s | Max: 28m 51s | Hits:  99%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/4   | Total: 20m 54s | Avg:  5m 13s | Max:  5m 29s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 48m | Avg:  7m 12s | Max: 29m 32s | Hits:  99%/1852  
  🟩 nvcc11.8           Pass: 100%/3   | Total: 17m 20s | Avg:  5m 46s | Max:  6m 19s
  🟩 nvcc12.5           Pass: 100%/4   | Total: 57m 30s | Avg: 14m 22s | Max: 15m 14s
  🟩 nvcc12.6           Pass: 100%/85  | Total: 10h 03m | Avg:  7m 05s | Max: 28m 51s | Hits:  99%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/4   | Total: 20m 54s | Avg:  5m 13s | Max:  5m 29s
  🟩 nvcc               Pass: 100%/107 | Total: 13h 06m | Avg:  7m 20s | Max: 29m 32s | Hits:  99%/9260  
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total: 34m 16s | Avg:  5m 42s | Max:  7m 07s
  🟩 Clang10            Pass: 100%/3   | Total: 19m 25s | Avg:  6m 28s | Max:  7m 08s
  🟩 Clang11            Pass: 100%/4   | Total: 21m 51s | Avg:  5m 27s | Max:  6m 02s
  🟩 Clang12            Pass: 100%/4   | Total: 20m 58s | Avg:  5m 14s | Max:  5m 46s
  🟩 Clang13            Pass: 100%/4   | Total: 21m 49s | Avg:  5m 27s | Max:  5m 44s
  🟩 Clang14            Pass: 100%/4   | Total: 22m 34s | Avg:  5m 38s | Max:  5m 54s
  🟩 Clang15            Pass: 100%/4   | Total: 22m 37s | Avg:  5m 39s | Max:  6m 12s
  🟩 Clang16            Pass: 100%/4   | Total: 22m 17s | Avg:  5m 34s | Max:  5m 47s
  🟩 Clang17            Pass: 100%/4   | Total: 23m 16s | Avg:  5m 49s | Max:  6m 07s
  🟩 Clang18            Pass: 100%/11  | Total:  1h 13m | Avg:  6m 42s | Max: 19m 50s
  🟩 GCC6               Pass: 100%/2   | Total:  8m 45s | Avg:  4m 22s | Max:  4m 37s
  🟩 GCC7               Pass: 100%/6   | Total: 29m 12s | Avg:  4m 52s | Max:  5m 52s
  🟩 GCC8               Pass: 100%/6   | Total: 55m 26s | Avg:  9m 14s | Max: 29m 32s
  🟩 GCC9               Pass: 100%/6   | Total: 30m 51s | Avg:  5m 08s | Max:  6m 09s
  🟩 GCC10              Pass: 100%/4   | Total: 46m 01s | Avg: 11m 30s | Max: 28m 51s
  🟩 GCC11              Pass: 100%/7   | Total: 40m 05s | Avg:  5m 43s | Max:  6m 19s
  🟩 GCC12              Pass: 100%/4   | Total: 23m 36s | Avg:  5m 54s | Max:  6m 14s
  🟩 GCC13              Pass: 100%/16  | Total:  2h 00m | Avg:  7m 31s | Max: 20m 05s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total: 21m 47s | Avg:  7m 15s | Max:  7m 31s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 25s | Avg: 18m 25s | Max: 18m 25s | Hits:  99%/1852  
  🟩 MSVC14.29          Pass: 100%/2   | Total: 31m 52s | Avg: 15m 56s | Max: 16m 21s | Hits:  99%/3704  
  🟩 MSVC14.39          Pass: 100%/2   | Total: 40m 17s | Avg: 20m 08s | Max: 23m 26s | Hits:  99%/3704  
  🟩 NVHPC24.7          Pass: 100%/4   | Total: 57m 30s | Avg: 14m 22s | Max: 15m 14s
🟩 cxx_family
  🟩 Clang              Pass: 100%/48  | Total:  4h 42m | Avg:  5m 53s | Max: 19m 50s
  🟩 GCC                Pass: 100%/51  | Total:  5h 54m | Avg:  6m 56s | Max: 29m 32s
  🟩 Intel              Pass: 100%/3   | Total: 21m 47s | Avg:  7m 15s | Max:  7m 31s
  🟩 MSVC               Pass: 100%/5   | Total:  1h 30m | Avg: 18m 06s | Max: 23m 26s | Hits:  99%/9260  
  🟩 NVHPC              Pass: 100%/4   | Total: 57m 30s | Avg: 14m 22s | Max: 15m 14s
🟩 gpu
  🟩 v100               Pass: 100%/111 | Total: 13h 26m | Avg:  7m 16s | Max: 29m 32s | Hits:  99%/9260  
🟩 jobs
  🟩 Build              Pass: 100%/103 | Total: 11h 32m | Avg:  6m 43s | Max: 29m 32s | Hits:  99%/7408  
  🟩 TestCPU            Pass: 100%/4   | Total: 48m 01s | Avg: 12m 00s | Max: 23m 26s | Hits:  99%/1852  
  🟩 TestGPU            Pass: 100%/4   | Total:  1h 06m | Avg: 16m 31s | Max: 20m 05s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 17m 20s | Avg:  5m 46s | Max:  6m 19s
  🟩 90a                Pass: 100%/4   | Total: 18m 34s | Avg:  4m 38s | Max:  5m 02s
🟩 std
  🟩 11                 Pass: 100%/30  | Total:  3h 47m | Avg:  7m 34s | Max: 29m 32s
  🟩 14                 Pass: 100%/29  | Total:  3h 11m | Avg:  6m 35s | Max: 18m 25s | Hits:  99%/3704  
  🟩 17                 Pass: 100%/27  | Total:  2h 54m | Avg:  6m 26s | Max: 15m 31s | Hits:  99%/1852  
  🟩 20                 Pass: 100%/23  | Total:  3h 16m | Avg:  8m 33s | Max: 23m 26s | Hits:  99%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 36s | Avg: 5m 18s | Max: 8m 11s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 10m 36s | Avg:  5m 18s | Max:  8m 11s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 10m 36s | Avg:  5m 18s | Max:  8m 11s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 36s | Avg:  5m 18s | Max:  8m 11s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 10m 36s | Avg:  5m 18s | Max:  8m 11s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 10m 36s | Avg:  5m 18s | Max:  8m 11s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 10m 36s | Avg:  5m 18s | Max:  8m 11s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 10m 36s | Avg:  5m 18s | Max:  8m 11s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 25s | Avg:  2m 25s | Max:  2m 25s
  🟩 Test               Pass: 100%/1   | Total:  8m 11s | Avg:  8m 11s | Max:  8m 11s

🟩 python: Pass: 100%/1 | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 224)

#	Runner
185	`linux-amd64-cpu16`
16	`linux-arm64-cpu16`
14	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`

bernhardmgruber

Last few nits.

cub/cub/device/device_reduce.cuh

cub/cub/device/dispatch/dispatch_streaming_reduce.cuh

bernhardmgruber · 2024-12-05T10:42:19Z

cub/cub/device/dispatch/dispatch_streaming_reduce.cuh

+          cub::KeyValuePair<::cuda::std::int32_t, ResultT>>>>;
+
+    using index_t = typename kv_pair_t::Key;
+    *out_it       = kv_pair_t{static_cast<index_t>(reduced_result.key), reduced_result.value};


Important: this comment is marked as resolved, but I don't see an assert that checks whether static_cast<index_t>(reduced_result.key) does change the value. You could add:

Suggested change

*out_it = kv_pair_t{static_cast<index_t>(reduced_result.key), reduced_result.value};

_CCCL_ASSERT(static_cast<OffsetT>(static_cast<index_t>(reduced_result.key)) == reduced_result.key);

*out_it = kv_pair_t{static_cast<index_t>(reduced_result.key), reduced_result.value};

cub/test/catch2_test_device_reduce_large_offsets.cu

github-actions · 2024-12-19T15:29:51Z

🟩 CI finished in 1h 25m: Pass: 100%/96 | Total: 2d 13h | Avg: 38m 25s | Max: 1h 04m | Hits: 74%/12404

🟩 cub: Pass: 100%/47 | Total: 1d 13h | Avg: 47m 55s | Max: 1h 04m | Hits: 63%/3144

🟩 cpu
  🟩 amd64              Pass: 100%/45  | Total:  1d 11h | Avg: 47m 31s | Max:  1h 04m | Hits:  63%/3144  
  🟩 arm64              Pass: 100%/2   | Total:  1h 53m | Avg: 56m 45s | Max: 57m 07s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total:  5h 42m | Avg: 48m 58s | Max: 59m 26s | Hits:  63%/786   
  🟩 12.5               Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 04m
  🟩 12.6               Pass: 100%/38  | Total:  1d 05h | Avg: 46m 58s | Max:  1h 02m | Hits:  63%/2358  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 53m | Avg: 56m 37s | Max: 57m 54s
  🟩 nvcc11.1           Pass: 100%/7   | Total:  5h 42m | Avg: 48m 58s | Max: 59m 26s | Hits:  63%/786   
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 04m
  🟩 nvcc12.6           Pass: 100%/36  | Total:  1d 03h | Avg: 46m 26s | Max:  1h 02m | Hits:  63%/2358  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 53m | Avg: 56m 37s | Max: 57m 54s
  🟩 nvcc               Pass: 100%/45  | Total:  1d 11h | Avg: 47m 32s | Max:  1h 04m | Hits:  63%/3144  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total:  3h 28m | Avg: 52m 08s | Max: 56m 41s
  🟩 Clang10            Pass: 100%/1   | Total: 53m 10s | Avg: 53m 10s | Max: 53m 10s
  🟩 Clang11            Pass: 100%/1   | Total: 50m 36s | Avg: 50m 36s | Max: 50m 36s
  🟩 Clang12            Pass: 100%/1   | Total: 51m 05s | Avg: 51m 05s | Max: 51m 05s
  🟩 Clang13            Pass: 100%/1   | Total: 50m 41s | Avg: 50m 41s | Max: 50m 41s
  🟩 Clang14            Pass: 100%/1   | Total: 50m 27s | Avg: 50m 27s | Max: 50m 27s
  🟩 Clang15            Pass: 100%/1   | Total: 50m 51s | Avg: 50m 51s | Max: 50m 51s
  🟩 Clang16            Pass: 100%/1   | Total: 58m 41s | Avg: 58m 41s | Max: 58m 41s
  🟩 Clang17            Pass: 100%/1   | Total: 54m 56s | Avg: 54m 56s | Max: 54m 56s
  🟩 Clang18            Pass: 100%/7   | Total:  5h 17m | Avg: 45m 20s | Max: 57m 54s
  🟩 GCC6               Pass: 100%/2   | Total:  1h 32m | Avg: 46m 10s | Max: 49m 25s
  🟩 GCC7               Pass: 100%/2   | Total:  1h 47m | Avg: 53m 33s | Max: 57m 44s
  🟩 GCC8               Pass: 100%/1   | Total: 50m 32s | Avg: 50m 32s | Max: 50m 32s
  🟩 GCC9               Pass: 100%/3   | Total:  2h 23m | Avg: 47m 58s | Max: 51m 48s
  🟩 GCC10              Pass: 100%/1   | Total: 55m 50s | Avg: 55m 50s | Max: 55m 50s
  🟩 GCC11              Pass: 100%/1   | Total: 51m 24s | Avg: 51m 24s | Max: 51m 24s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 40m | Avg: 33m 27s | Max: 57m 19s
  🟩 GCC13              Pass: 100%/8   | Total:  4h 36m | Avg: 34m 33s | Max: 57m 07s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total: 59m 32s | Avg: 59m 32s | Max: 59m 32s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 59m 26s | Avg: 59m 26s | Max: 59m 26s | Hits:  63%/786   
  🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m | Hits:  63%/786   
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 02m | Hits:  63%/1572  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 04m
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total: 15h 46m | Avg: 49m 48s | Max: 58m 41s
  🟩 GCC                Pass: 100%/21  | Total: 14h 37m | Avg: 41m 48s | Max: 57m 44s
  🟩 Intel              Pass: 100%/1   | Total: 59m 32s | Avg: 59m 32s | Max: 59m 32s
  🟩 MSVC               Pass: 100%/4   | Total:  4h 04m | Avg:  1h 01m | Max:  1h 02m | Hits:  63%/3144  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 04m
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 43m 04s | Avg: 21m 32s | Max: 24m 45s
  🟩 v100               Pass: 100%/45  | Total:  1d 12h | Avg: 49m 05s | Max:  1h 04m | Hits:  63%/3144  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  1d 11h | Avg: 52m 43s | Max:  1h 04m | Hits:  63%/3144  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 16m 50s | Avg: 16m 50s | Max: 16m 50s
  🟩 GraphCapture       Pass: 100%/1   | Total: 19m 39s | Avg: 19m 39s | Max: 19m 39s
  🟩 HostLaunch         Pass: 100%/3   | Total: 58m 04s | Avg: 19m 21s | Max: 21m 26s
  🟩 TestGPU            Pass: 100%/2   | Total: 48m 48s | Avg: 24m 24s | Max: 26m 42s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 43m 04s | Avg: 21m 32s | Max: 24m 45s
  🟩 90a                Pass: 100%/1   | Total: 23m 00s | Avg: 23m 00s | Max: 23m 00s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  3h 55m | Avg: 47m 10s | Max: 52m 59s
  🟩 14                 Pass: 100%/4   | Total:  3h 43m | Avg: 55m 49s | Max: 59m 26s | Hits:  63%/786   
  🟩 17                 Pass: 100%/12  | Total: 11h 09m | Avg: 55m 48s | Max:  1h 04m | Hits:  63%/1572  
  🟩 20                 Pass: 100%/26  | Total: 18h 43m | Avg: 43m 13s | Max:  1h 02m | Hits:  63%/786

🟩 thrust: Pass: 100%/46 | Total: 23h 21m | Avg: 30m 27s | Max: 56m 44s | Hits: 77%/9260

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 39m 19s | Avg: 19m 39s | Max: 27m 32s
🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total: 22h 18m | Avg: 30m 24s | Max: 56m 44s | Hits:  77%/9260  
  🟩 arm64              Pass: 100%/2   | Total:  1h 03m | Avg: 31m 30s | Max: 35m 00s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total:  3h 27m | Avg: 29m 36s | Max: 54m 23s | Hits:  71%/1852  
  🟩 12.5               Pass: 100%/2   | Total:  1h 42m | Avg: 51m 04s | Max: 54m 30s
  🟩 12.6               Pass: 100%/37  | Total: 18h 11m | Avg: 29m 30s | Max: 56m 44s | Hits:  78%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 54m 06s | Avg: 27m 03s | Max: 28m 33s
  🟩 nvcc11.1           Pass: 100%/7   | Total:  3h 27m | Avg: 29m 36s | Max: 54m 23s | Hits:  71%/1852  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 42m | Avg: 51m 04s | Max: 54m 30s
  🟩 nvcc12.6           Pass: 100%/35  | Total: 17h 17m | Avg: 29m 38s | Max: 56m 44s | Hits:  78%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 54m 06s | Avg: 27m 03s | Max: 28m 33s
  🟩 nvcc               Pass: 100%/44  | Total: 22h 27m | Avg: 30m 36s | Max: 56m 44s | Hits:  77%/9260  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total:  1h 45m | Avg: 26m 18s | Max: 31m 16s
  🟩 Clang10            Pass: 100%/1   | Total: 35m 10s | Avg: 35m 10s | Max: 35m 10s
  🟩 Clang11            Pass: 100%/1   | Total: 28m 58s | Avg: 28m 58s | Max: 28m 58s
  🟩 Clang12            Pass: 100%/1   | Total: 29m 09s | Avg: 29m 09s | Max: 29m 09s
  🟩 Clang13            Pass: 100%/1   | Total: 29m 05s | Avg: 29m 05s | Max: 29m 05s
  🟩 Clang14            Pass: 100%/1   | Total: 32m 30s | Avg: 32m 30s | Max: 32m 30s
  🟩 Clang15            Pass: 100%/1   | Total: 29m 44s | Avg: 29m 44s | Max: 29m 44s
  🟩 Clang16            Pass: 100%/1   | Total: 31m 47s | Avg: 31m 47s | Max: 31m 47s
  🟩 Clang17            Pass: 100%/1   | Total: 33m 12s | Avg: 33m 12s | Max: 33m 12s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 41m | Avg: 23m 00s | Max: 30m 49s
  🟩 GCC6               Pass: 100%/2   | Total: 51m 49s | Avg: 25m 54s | Max: 29m 06s
  🟩 GCC7               Pass: 100%/2   | Total: 53m 29s | Avg: 26m 44s | Max: 30m 25s
  🟩 GCC8               Pass: 100%/1   | Total: 30m 32s | Avg: 30m 32s | Max: 30m 32s
  🟩 GCC9               Pass: 100%/3   | Total:  1h 24m | Avg: 28m 05s | Max: 32m 40s
  🟩 GCC10              Pass: 100%/1   | Total: 31m 04s | Avg: 31m 04s | Max: 31m 04s
  🟩 GCC11              Pass: 100%/1   | Total: 31m 01s | Avg: 31m 01s | Max: 31m 01s
  🟩 GCC12              Pass: 100%/1   | Total: 32m 05s | Avg: 32m 05s | Max: 32m 05s
  🟩 GCC13              Pass: 100%/8   | Total:  3h 00m | Avg: 22m 30s | Max: 35m 00s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total: 43m 14s | Avg: 43m 14s | Max: 43m 14s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 54m 23s | Avg: 54m 23s | Max: 54m 23s | Hits:  71%/1852  
  🟩 MSVC14.29          Pass: 100%/1   | Total: 54m 00s | Avg: 54m 00s | Max: 54m 00s | Hits:  71%/1852  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 17m | Avg: 45m 45s | Max: 56m 44s | Hits:  81%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 42m | Avg: 51m 04s | Max: 54m 30s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  8h 35m | Avg: 27m 09s | Max: 35m 10s
  🟩 GCC                Pass: 100%/19  | Total:  8h 14m | Avg: 26m 00s | Max: 35m 00s
  🟩 Intel              Pass: 100%/1   | Total: 43m 14s | Avg: 43m 14s | Max: 43m 14s
  🟩 MSVC               Pass: 100%/5   | Total:  4h 05m | Avg: 49m 07s | Max: 56m 44s | Hits:  77%/9260  
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 42m | Avg: 51m 04s | Max: 54m 30s
🟩 gpu
  🟩 v100               Pass: 100%/46  | Total: 23h 21m | Avg: 30m 27s | Max: 56m 44s | Hits:  77%/9260  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total: 22h 06m | Avg: 33m 10s | Max: 56m 44s | Hits:  71%/7408  
  🟩 TestCPU            Pass: 100%/3   | Total: 40m 13s | Avg: 13m 24s | Max: 24m 06s | Hits:  99%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total: 34m 14s | Avg: 11m 24s | Max: 12m 00s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total: 18m 40s | Avg: 18m 40s | Max: 18m 40s
🟩 std
  🟩 11                 Pass: 100%/5   | Total:  1h 54m | Avg: 22m 53s | Max: 24m 30s
  🟩 14                 Pass: 100%/4   | Total:  2h 25m | Avg: 36m 17s | Max: 54m 23s | Hits:  71%/1852  
  🟩 17                 Pass: 100%/12  | Total:  7h 33m | Avg: 37m 45s | Max: 56m 25s | Hits:  71%/3704  
  🟩 20                 Pass: 100%/23  | Total: 10h 49m | Avg: 28m 13s | Max: 56m 44s | Hits:  85%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 07s | Avg: 4m 33s | Max: 7m 01s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 07s | Avg:  4m 33s | Max:  7m 01s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 07s | Avg:  4m 33s | Max:  7m 01s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 07s | Avg:  4m 33s | Max:  7m 01s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 07s | Avg:  4m 33s | Max:  7m 01s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 07s | Avg:  4m 33s | Max:  7m 01s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 07s | Avg:  4m 33s | Max:  7m 01s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 07s | Avg:  4m 33s | Max:  7m 01s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 06s | Avg:  2m 06s | Max:  2m 06s
  🟩 Test               Pass: 100%/1   | Total:  7m 01s | Avg:  7m 01s | Max:  7m 01s

🟩 python: Pass: 100%/1 | Total: 25m 29s | Avg: 25m 29s | Max: 25m 29s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 25m 29s | Avg: 25m 29s | Max: 25m 29s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 25m 29s | Avg: 25m 29s | Max: 25m 29s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 29s | Avg: 25m 29s | Max: 25m 29s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 25m 29s | Avg: 25m 29s | Max: 25m 29s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 25m 29s | Avg: 25m 29s | Max: 25m 29s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 25m 29s | Avg: 25m 29s | Max: 25m 29s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 25m 29s | Avg: 25m 29s | Max: 25m 29s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 25m 29s | Avg: 25m 29s | Max: 25m 29s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 96)

#	Runner
71	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

github-actions · 2024-12-19T17:14:33Z

🟩 CI finished in 57m 02s: Pass: 100%/96 | Total: 13h 20m | Avg: 8m 20s | Max: 26m 17s | Hits: 99%/12404

🟩 cub: Pass: 100%/47 | Total: 6h 48m | Avg: 8m 41s | Max: 26m 17s | Hits: 98%/3144

🟩 cpu
  🟩 amd64              Pass: 100%/45  | Total:  6h 38m | Avg:  8m 51s | Max: 26m 17s | Hits:  98%/3144  
  🟩 arm64              Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  5m 13s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total: 42m 39s | Avg:  6m 05s | Max: 16m 00s | Hits:  98%/786   
  🟩 12.5               Pass: 100%/2   | Total: 18m 13s | Avg:  9m 06s | Max:  9m 12s
  🟩 12.6               Pass: 100%/38  | Total:  5h 47m | Avg:  9m 08s | Max: 26m 17s | Hits:  98%/2358  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  8m 31s | Avg:  4m 15s | Max:  4m 16s
  🟩 nvcc11.1           Pass: 100%/7   | Total: 42m 39s | Avg:  6m 05s | Max: 16m 00s | Hits:  98%/786   
  🟩 nvcc12.5           Pass: 100%/2   | Total: 18m 13s | Avg:  9m 06s | Max:  9m 12s
  🟩 nvcc12.6           Pass: 100%/36  | Total:  5h 39m | Avg:  9m 25s | Max: 26m 17s | Hits:  98%/2358  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  8m 31s | Avg:  4m 15s | Max:  4m 16s
  🟩 nvcc               Pass: 100%/45  | Total:  6h 39m | Avg:  8m 53s | Max: 26m 17s | Hits:  98%/3144  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 21m 32s | Avg:  5m 23s | Max:  6m 37s
  🟩 Clang10            Pass: 100%/1   | Total:  6m 47s | Avg:  6m 47s | Max:  6m 47s
  🟩 Clang11            Pass: 100%/1   | Total:  5m 14s | Avg:  5m 14s | Max:  5m 14s
  🟩 Clang12            Pass: 100%/1   | Total:  5m 35s | Avg:  5m 35s | Max:  5m 35s
  🟩 Clang13            Pass: 100%/1   | Total:  5m 15s | Avg:  5m 15s | Max:  5m 15s
  🟩 Clang14            Pass: 100%/1   | Total:  5m 51s | Avg:  5m 51s | Max:  5m 51s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 39s | Avg:  5m 39s | Max:  5m 39s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 24s | Avg:  5m 24s | Max:  5m 24s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 33s | Avg:  5m 33s | Max:  5m 33s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 05m | Avg:  9m 24s | Max: 24m 34s
  🟩 GCC6               Pass: 100%/2   | Total:  8m 35s | Avg:  4m 17s | Max:  4m 24s
  🟩 GCC7               Pass: 100%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  5m 11s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 22s | Avg:  5m 22s | Max:  5m 22s
  🟩 GCC9               Pass: 100%/3   | Total: 14m 41s | Avg:  4m 53s | Max:  5m 32s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 38s | Avg:  5m 38s | Max:  5m 38s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 30s | Avg:  5m 30s | Max:  5m 30s
  🟩 GCC12              Pass: 100%/3   | Total: 26m 00s | Avg:  8m 40s | Max: 16m 00s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 58m | Avg: 14m 51s | Max: 26m 17s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total:  7m 11s | Avg:  7m 11s | Max:  7m 11s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 16m 00s | Avg: 16m 00s | Max: 16m 00s | Hits:  98%/786   
  🟩 MSVC14.29          Pass: 100%/1   | Total: 12m 35s | Avg: 12m 35s | Max: 12m 35s | Hits:  98%/786   
  🟩 MSVC14.39          Pass: 100%/2   | Total: 26m 55s | Avg: 13m 27s | Max: 13m 28s | Hits:  98%/1572  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 18m 13s | Avg:  9m 06s | Max:  9m 12s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  2h 12m | Avg:  6m 58s | Max: 24m 34s
  🟩 GCC                Pass: 100%/21  | Total:  3h 14m | Avg:  9m 16s | Max: 26m 17s
  🟩 Intel              Pass: 100%/1   | Total:  7m 11s | Avg:  7m 11s | Max:  7m 11s
  🟩 MSVC               Pass: 100%/4   | Total: 55m 30s | Avg: 13m 52s | Max: 16m 00s | Hits:  98%/3144  
  🟩 NVHPC              Pass: 100%/2   | Total: 18m 13s | Avg:  9m 06s | Max:  9m 12s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 20m 05s | Avg: 10m 02s | Max: 16m 00s
  🟩 v100               Pass: 100%/45  | Total:  6h 28m | Avg:  8m 37s | Max: 26m 17s | Hits:  98%/3144  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  4h 13m | Avg:  6m 20s | Max: 16m 00s | Hits:  98%/3144  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 25m 19s | Avg: 25m 19s | Max: 25m 19s
  🟩 GraphCapture       Pass: 100%/1   | Total: 21m 32s | Avg: 21m 32s | Max: 21m 32s
  🟩 HostLaunch         Pass: 100%/3   | Total: 57m 00s | Avg: 19m 00s | Max: 24m 16s
  🟩 TestGPU            Pass: 100%/2   | Total: 50m 51s | Avg: 25m 25s | Max: 26m 17s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 20m 05s | Avg: 10m 02s | Max: 16m 00s
  🟩 90a                Pass: 100%/1   | Total:  4m 17s | Avg:  4m 17s | Max:  4m 17s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 24m 06s | Avg:  4m 49s | Max:  6m 00s
  🟩 14                 Pass: 100%/4   | Total: 31m 59s | Avg:  7m 59s | Max: 16m 00s | Hits:  98%/786   
  🟩 17                 Pass: 100%/12  | Total:  1h 25m | Avg:  7m 05s | Max: 13m 28s | Hits:  98%/1572  
  🟩 20                 Pass: 100%/26  | Total:  4h 27m | Avg: 10m 17s | Max: 26m 17s | Hits:  98%/786

🟩 thrust: Pass: 100%/46 | Total: 5h 57m | Avg: 7m 46s | Max: 21m 35s | Hits: 99%/9260

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 18m 41s | Avg:  9m 20s | Max: 13m 01s
🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total:  5h 47m | Avg:  7m 53s | Max: 21m 35s | Hits:  99%/9260  
  🟩 arm64              Pass: 100%/2   | Total:  9m 58s | Avg:  4m 59s | Max:  5m 20s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total: 44m 02s | Avg:  6m 17s | Max: 17m 59s | Hits:  99%/1852  
  🟩 12.5               Pass: 100%/2   | Total: 29m 02s | Avg: 14m 31s | Max: 15m 29s
  🟩 12.6               Pass: 100%/37  | Total:  4h 44m | Avg:  7m 40s | Max: 21m 35s | Hits:  99%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  4m 50s
  🟩 nvcc11.1           Pass: 100%/7   | Total: 44m 02s | Avg:  6m 17s | Max: 17m 59s | Hits:  99%/1852  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 29m 02s | Avg: 14m 31s | Max: 15m 29s
  🟩 nvcc12.6           Pass: 100%/35  | Total:  4h 34m | Avg:  7m 50s | Max: 21m 35s | Hits:  99%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  4m 50s
  🟩 nvcc               Pass: 100%/44  | Total:  5h 47m | Avg:  7m 54s | Max: 21m 35s | Hits:  99%/9260  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 20m 32s | Avg:  5m 08s | Max:  6m 07s
  🟩 Clang10            Pass: 100%/1   | Total:  6m 26s | Avg:  6m 26s | Max:  6m 26s
  🟩 Clang11            Pass: 100%/1   | Total:  5m 03s | Avg:  5m 03s | Max:  5m 03s
  🟩 Clang12            Pass: 100%/1   | Total:  5m 05s | Avg:  5m 05s | Max:  5m 05s
  🟩 Clang13            Pass: 100%/1   | Total:  5m 00s | Avg:  5m 00s | Max:  5m 00s
  🟩 Clang14            Pass: 100%/1   | Total:  5m 22s | Avg:  5m 22s | Max:  5m 22s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 33s | Avg:  5m 33s | Max:  5m 33s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 25s | Avg:  5m 25s | Max:  5m 25s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 15s | Avg:  5m 15s | Max:  5m 15s
  🟩 Clang18            Pass: 100%/7   | Total: 46m 26s | Avg:  6m 38s | Max: 13m 56s
  🟩 GCC6               Pass: 100%/2   | Total:  8m 24s | Avg:  4m 12s | Max:  4m 19s
  🟩 GCC7               Pass: 100%/2   | Total:  9m 23s | Avg:  4m 41s | Max:  4m 59s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
  🟩 GCC9               Pass: 100%/3   | Total: 14m 43s | Avg:  4m 54s | Max:  5m 59s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 19s | Avg:  5m 19s | Max:  5m 19s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
  🟩 GCC12              Pass: 100%/1   | Total:  5m 33s | Avg:  5m 33s | Max:  5m 33s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 06m | Avg:  8m 20s | Max: 17m 43s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total:  6m 47s | Avg:  6m 47s | Max:  6m 47s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 17m 59s | Avg: 17m 59s | Max: 17m 59s | Hits:  99%/1852  
  🟩 MSVC14.29          Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s | Hits:  99%/1852  
  🟩 MSVC14.39          Pass: 100%/3   | Total: 56m 05s | Avg: 18m 41s | Max: 21m 35s | Hits:  99%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 29m 02s | Avg: 14m 31s | Max: 15m 29s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  1h 50m | Avg:  5m 47s | Max: 13m 56s
  🟩 GCC                Pass: 100%/19  | Total:  2h 01m | Avg:  6m 24s | Max: 17m 43s
  🟩 Intel              Pass: 100%/1   | Total:  6m 47s | Avg:  6m 47s | Max:  6m 47s
  🟩 MSVC               Pass: 100%/5   | Total:  1h 29m | Avg: 17m 54s | Max: 21m 35s | Hits:  99%/9260  
  🟩 NVHPC              Pass: 100%/2   | Total: 29m 02s | Avg: 14m 31s | Max: 15m 29s
🟩 gpu
  🟩 v100               Pass: 100%/46  | Total:  5h 57m | Avg:  7m 46s | Max: 21m 35s | Hits:  99%/9260  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  4h 35m | Avg:  6m 52s | Max: 17m 59s | Hits:  99%/7408  
  🟩 TestCPU            Pass: 100%/3   | Total: 37m 27s | Avg: 12m 29s | Max: 21m 35s | Hits:  99%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total: 44m 40s | Avg: 14m 53s | Max: 17m 43s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 18s | Avg:  4m 18s | Max:  4m 18s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 21m 56s | Avg:  4m 23s | Max:  5m 30s
  🟩 14                 Pass: 100%/4   | Total: 33m 24s | Avg:  8m 21s | Max: 17m 59s | Hits:  99%/1852  
  🟩 17                 Pass: 100%/12  | Total:  1h 39m | Avg:  8m 19s | Max: 17m 43s | Hits:  99%/3704  
  🟩 20                 Pass: 100%/23  | Total:  3h 03m | Avg:  7m 58s | Max: 21m 35s | Hits:  99%/3704

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 56s | Avg: 4m 28s | Max: 6m 59s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  8m 56s | Avg:  4m 28s | Max:  6m 59s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  8m 56s | Avg:  4m 28s | Max:  6m 59s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 56s | Avg:  4m 28s | Max:  6m 59s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  8m 56s | Avg:  4m 28s | Max:  6m 59s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  8m 56s | Avg:  4m 28s | Max:  6m 59s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  8m 56s | Avg:  4m 28s | Max:  6m 59s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  8m 56s | Avg:  4m 28s | Max:  6m 59s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  1m 57s | Avg:  1m 57s | Max:  1m 57s
  🟩 Test               Pass: 100%/1   | Total:  6m 59s | Avg:  6m 59s | Max:  6m 59s

🟩 python: Pass: 100%/1 | Total: 25m 44s | Avg: 25m 44s | Max: 25m 44s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 25m 44s | Avg: 25m 44s | Max: 25m 44s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 25m 44s | Avg: 25m 44s | Max: 25m 44s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 44s | Avg: 25m 44s | Max: 25m 44s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 25m 44s | Avg: 25m 44s | Max: 25m 44s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 25m 44s | Avg: 25m 44s | Max: 25m 44s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 25m 44s | Avg: 25m 44s | Max: 25m 44s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 25m 44s | Avg: 25m 44s | Max: 25m 44s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 25m 44s | Avg: 25m 44s | Max: 25m 44s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 96)

#	Runner
71	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

elstehle · 2024-12-20T14:18:32Z

Performance comparison for `old.main.i64 offset type` vs. streaming approach:

Summary for all rows

Minimum %Diff: -27.44% (i.e., 1.38-fold higher throughput)
Maximum %Diff: 5.64% (5% slow-down, but really only for a tiny problem size)

Performance comparison for `old.main.i64` vs. streaming approach

T{ct}	Operation{ct}	Elements{io}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	ArgMin	2^16	9.105 us	2.33%	9.338 us	2.53%	0.233 us	2.56%	SLOW
I8	ArgMin	2^20	12.638 us	1.44%	11.554 us	1.77%	-1.084 us	-8.58%	FAST
I8	ArgMin	2^24	41.738 us	0.52%	33.996 us	0.52%	-7.743 us	-18.55%	FAST
I8	ArgMin	2^28	339.095 us	0.43%	246.679 us	0.66%	-92.416 us	-27.25%	FAST
I8	ArgMax	2^16	9.432 us	2.06%	9.499 us	2.07%	0.067 us	0.71%	SAME
I8	ArgMax	2^20	12.650 us	1.74%	11.540 us	1.84%	-1.110 us	-8.77%	FAST
I8	ArgMax	2^24	41.745 us	0.56%	33.918 us	0.64%	-7.828 us	-18.75%	FAST
I8	ArgMax	2^28	338.962 us	0.35%	245.951 us	0.77%	-93.010 us	-27.44%	FAST
I16	ArgMin	2^16	9.347 us	2.54%	9.583 us	2.24%	0.237 us	2.53%	SLOW
I16	ArgMin	2^20	12.476 us	1.67%	11.917 us	1.57%	-0.559 us	-4.48%	FAST
I16	ArgMin	2^24	46.075 us	0.62%	38.511 us	0.68%	-7.565 us	-16.42%	FAST
I16	ArgMin	2^28	346.506 us	1.98%	310.016 us	2.11%	-36.490 us	-10.53%	FAST
I16	ArgMax	2^16	9.296 us	2.43%	9.615 us	1.68%	0.319 us	3.43%	SLOW
I16	ArgMax	2^20	12.463 us	1.67%	11.889 us	1.73%	-0.574 us	-4.61%	FAST
I16	ArgMax	2^24	45.914 us	0.58%	38.449 us	0.66%	-7.465 us	-16.26%	FAST
I16	ArgMax	2^28	345.574 us	1.85%	310.318 us	2.10%	-35.256 us	-10.20%	FAST
I32	ArgMin	2^16	8.847 us	2.19%	8.512 us	2.31%	-0.334 us	-3.78%	FAST
I32	ArgMin	2^20	12.941 us	1.59%	12.460 us	1.82%	-0.482 us	-3.72%	FAST
I32	ArgMin	2^24	59.855 us	0.40%	56.674 us	0.47%	-3.181 us	-5.31%	FAST
I32	ArgMin	2^28	579.582 us	1.33%	576.438 us	1.22%	-3.144 us	-0.54%	SAME
I32	ArgMax	2^16	8.812 us	1.55%	8.505 us	2.21%	-0.308 us	-3.49%	FAST
I32	ArgMax	2^20	13.020 us	1.67%	12.493 us	1.80%	-0.528 us	-4.05%	FAST
I32	ArgMax	2^24	59.830 us	0.44%	56.633 us	0.49%	-3.196 us	-5.34%	FAST
I32	ArgMax	2^28	579.888 us	1.32%	576.493 us	1.23%	-3.395 us	-0.59%	SAME
I64	ArgMin	2^16	8.701 us	1.73%	9.121 us	2.49%	0.420 us	4.83%	SLOW
I64	ArgMin	2^20	14.708 us	1.13%	15.204 us	1.31%	0.497 us	3.38%	SLOW
I64	ArgMin	2^24	93.593 us	0.27%	94.248 us	0.32%	0.655 us	0.70%	SLOW
I64	ArgMin	2^28	1.120 ms	0.87%	1.116 ms	0.81%	-4.249 us	-0.38%	SAME
I64	ArgMax	2^16	8.598 us	1.59%	9.083 us	2.45%	0.485 us	5.64%	SLOW
I64	ArgMax	2^20	14.839 us	1.29%	15.271 us	1.39%	0.431 us	2.91%	SLOW
I64	ArgMax	2^24	93.671 us	0.31%	94.310 us	0.32%	0.639 us	0.68%	SLOW
I64	ArgMax	2^28	1.120 ms	0.88%	1.116 ms	0.82%	-4.115 us	-0.37%	SAME
I128	ArgMin	2^16	10.218 us	1.89%	10.497 us	1.56%	0.279 us	2.73%	SLOW
I128	ArgMin	2^20	24.396 us	0.85%	24.242 us	0.92%	-0.153 us	-0.63%	SAME
I128	ArgMin	2^24	165.965 us	0.91%	167.414 us	0.71%	1.449 us	0.87%	SLOW
I128	ArgMin	2^28	2.214 ms	0.55%	2.206 ms	0.50%	-7.729 us	-0.35%	SAME
I128	ArgMax	2^16	10.515 us	1.83%	10.305 us	2.07%	-0.211 us	-2.01%	FAST
I128	ArgMax	2^20	24.632 us	0.94%	23.853 us	0.87%	-0.780 us	-3.16%	FAST
I128	ArgMax	2^24	166.315 us	0.93%	167.585 us	0.59%	1.270 us	0.76%	SLOW
I128	ArgMax	2^28	2.214 ms	0.54%	2.206 ms	0.50%	-7.473 us	-0.34%	SAME
F32	ArgMin	2^16	9.295 us	2.57%	8.222 us	2.36%	-1.073 us	-11.55%	FAST
F32	ArgMin	2^20	13.201 us	1.64%	12.146 us	1.35%	-1.055 us	-7.99%	FAST
F32	ArgMin	2^24	60.171 us	0.45%	56.509 us	0.44%	-3.662 us	-6.09%	FAST
F32	ArgMin	2^28	580.091 us	1.33%	576.227 us	1.23%	-3.864 us	-0.67%	SAME
F32	ArgMax	2^16	9.236 us	2.18%	8.304 us	2.41%	-0.932 us	-10.09%	FAST
F32	ArgMax	2^20	13.253 us	1.54%	12.098 us	1.74%	-1.155 us	-8.72%	FAST
F32	ArgMax	2^24	60.159 us	0.47%	56.480 us	0.43%	-3.679 us	-6.12%	FAST
F32	ArgMax	2^28	580.086 us	1.34%	576.152 us	1.23%	-3.934 us	-0.68%	SAME
F64	ArgMin	2^16	9.080 us	2.17%	9.179 us	2.86%	0.098 us	1.08%	SAME
F64	ArgMin	2^20	15.096 us	1.46%	15.044 us	1.42%	-0.052 us	-0.34%	SAME
F64	ArgMin	2^24	93.488 us	0.31%	94.285 us	0.30%	0.798 us	0.85%	SLOW
F64	ArgMin	2^28	1.116 ms	0.86%	1.116 ms	0.81%	-0.344 us	-0.03%	SAME
F64	ArgMax	2^16	9.128 us	2.39%	8.986 us	2.20%	-0.142 us	-1.56%	SAME
F64	ArgMax	2^20	15.059 us	1.62%	15.048 us	1.25%	-0.011 us	-0.07%	SAME
F64	ArgMax	2^24	93.443 us	0.30%	94.263 us	0.29%	0.819 us	0.88%	SLOW
F64	ArgMax	2^28	1.117 ms	0.85%	1.116 ms	0.80%	-0.703 us	-0.06%	SAME

Performance comparison for `old.main.i32` vs. streaming approach

Summary for rows where Elements{io} is 2^28

Minimum %Diff: -0.03%
Maximum %Diff: 2.51%

Performance comparison for `old.main.i32` vs. streaming approach

T{ct}	Operation{ct}	Elements{io}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	ArgMin	2^16	9.118 us	1.97%	9.338 us	2.53%	0.221 us	2.42%	SLOW
I8	ArgMin	2^20	11.384 us	1.49%	11.554 us	1.77%	0.170 us	1.49%	SAME
I8	ArgMin	2^24	33.450 us	0.63%	33.996 us	0.52%	0.546 us	1.63%	SLOW
I8	ArgMin	2^28	240.632 us	0.79%	246.679 us	0.66%	6.046 us	2.51%	SLOW
I8	ArgMax	2^16	9.424 us	2.57%	9.499 us	2.07%	0.075 us	0.80%	SAME
I8	ArgMax	2^20	11.358 us	1.69%	11.540 us	1.84%	0.182 us	1.61%	SAME
I8	ArgMax	2^24	33.498 us	0.67%	33.918 us	0.64%	0.419 us	1.25%	SLOW
I8	ArgMax	2^28	243.687 us	0.79%	245.951 us	0.77%	2.265 us	0.93%	SLOW
I16	ArgMin	2^16	9.495 us	2.63%	9.583 us	2.24%	0.088 us	0.93%	SAME
I16	ArgMin	2^20	11.671 us	1.42%	11.917 us	1.57%	0.246 us	2.11%	SLOW
I16	ArgMin	2^24	38.096 us	0.75%	38.511 us	0.68%	0.414 us	1.09%	SLOW
I16	ArgMin	2^28	309.922 us	2.11%	310.016 us	2.11%	0.094 us	0.03%	SAME
I16	ArgMax	2^16	9.346 us	2.64%	9.615 us	1.68%	0.269 us	2.87%	SLOW
I16	ArgMax	2^20	11.692 us	1.88%	11.889 us	1.73%	0.197 us	1.69%	SAME
I16	ArgMax	2^24	38.072 us	0.75%	38.449 us	0.66%	0.378 us	0.99%	SLOW
I16	ArgMax	2^28	309.651 us	2.13%	310.318 us	2.10%	0.667 us	0.22%	SAME
I32	ArgMin	2^16	8.017 us	1.77%	8.512 us	2.31%	0.495 us	6.18%	SLOW
I32	ArgMin	2^20	12.078 us	1.82%	12.460 us	1.82%	0.382 us	3.16%	SLOW
I32	ArgMin	2^24	56.300 us	0.44%	56.674 us	0.47%	0.374 us	0.67%	SLOW
I32	ArgMin	2^28	576.020 us	1.23%	576.438 us	1.22%	0.418 us	0.07%	SAME
I32	ArgMax	2^16	8.050 us	1.63%	8.505 us	2.21%	0.455 us	5.65%	SLOW
I32	ArgMax	2^20	12.016 us	1.23%	12.493 us	1.80%	0.476 us	3.96%	SLOW
I32	ArgMax	2^24	56.175 us	0.45%	56.633 us	0.49%	0.459 us	0.82%	SLOW
I32	ArgMax	2^28	576.168 us	1.22%	576.493 us	1.23%	0.324 us	0.06%	SAME
I64	ArgMin	2^16	8.708 us	2.19%	9.121 us	2.49%	0.414 us	4.75%	SLOW
I64	ArgMin	2^20	14.878 us	1.36%	15.204 us	1.31%	0.326 us	2.19%	SLOW
I64	ArgMin	2^24	93.945 us	0.30%	94.248 us	0.32%	0.303 us	0.32%	SLOW
I64	ArgMin	2^28	1.116 ms	0.82%	1.116 ms	0.81%	-0.039 us	-0.00%	SAME
I64	ArgMax	2^16	8.743 us	2.07%	9.083 us	2.45%	0.339 us	3.88%	SLOW
I64	ArgMax	2^20	14.939 us	1.40%	15.271 us	1.39%	0.332 us	2.22%	SLOW
I64	ArgMax	2^24	93.908 us	0.31%	94.310 us	0.32%	0.402 us	0.43%	SLOW
I64	ArgMax	2^28	1.116 ms	0.81%	1.116 ms	0.82%	0.088 us	0.01%	SAME
I128	ArgMin	2^16	10.168 us	1.77%	10.497 us	1.56%	0.329 us	3.23%	SLOW
I128	ArgMin	2^20	23.805 us	0.82%	24.242 us	0.92%	0.438 us	1.84%	SLOW
I128	ArgMin	2^24	167.297 us	0.51%	167.414 us	0.71%	0.117 us	0.07%	SAME
I128	ArgMin	2^28	2.206 ms	0.50%	2.206 ms	0.50%	0.288 us	0.01%	SAME
I128	ArgMax	2^16	10.461 us	1.78%	10.305 us	2.07%	-0.156 us	-1.50%	SAME
I128	ArgMax	2^20	24.045 us	0.99%	23.853 us	0.87%	-0.193 us	-0.80%	SAME
I128	ArgMax	2^24	167.588 us	0.54%	167.585 us	0.59%	-0.003 us	-0.00%	SAME
I128	ArgMax	2^28	2.206 ms	0.50%	2.206 ms	0.50%	0.573 us	0.03%	SAME
F32	ArgMin	2^16	8.489 us	2.06%	8.222 us	2.36%	-0.267 us	-3.15%	FAST
F32	ArgMin	2^20	12.315 us	1.71%	12.146 us	1.35%	-0.169 us	-1.37%	FAST
F32	ArgMin	2^24	56.654 us	0.45%	56.509 us	0.44%	-0.145 us	-0.26%	SAME
F32	ArgMin	2^28	576.334 us	1.21%	576.227 us	1.23%	-0.106 us	-0.02%	SAME
F32	ArgMax	2^16	8.492 us	2.31%	8.304 us	2.41%	-0.188 us	-2.21%	SAME
F32	ArgMax	2^20	12.281 us	1.78%	12.098 us	1.74%	-0.184 us	-1.50%	SAME
F32	ArgMax	2^24	56.601 us	0.47%	56.480 us	0.43%	-0.122 us	-0.21%	SAME
F32	ArgMax	2^28	576.297 us	1.23%	576.152 us	1.23%	-0.145 us	-0.03%	SAME
F64	ArgMin	2^16	9.214 us	1.91%	9.179 us	2.86%	-0.036 us	-0.39%	SAME
F64	ArgMin	2^20	15.252 us	1.51%	15.044 us	1.42%	-0.208 us	-1.37%	SAME
F64	ArgMin	2^24	94.372 us	0.32%	94.285 us	0.30%	-0.086 us	-0.09%	SAME
F64	ArgMin	2^28	1.116 ms	0.81%	1.116 ms	0.81%	0.000 us	0.00%	SAME
F64	ArgMax	2^16	9.181 us	2.28%	8.986 us	2.20%	-0.195 us	-2.13%	SAME
F64	ArgMax	2^20	15.272 us	1.45%	15.048 us	1.25%	-0.225 us	-1.47%	FAST
F64	ArgMax	2^24	94.383 us	0.30%	94.263 us	0.29%	-0.120 us	-0.13%	SAME
F64	ArgMax	2^28	1.116 ms	0.80%	1.116 ms	0.80%	-0.281 us	-0.03%	SAME

…NVIDIA#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface

implement `add_sat` split `signed`/`unsigned` implementation, improve implementation for MSVC improve device `add_sat` implementation add `add_sat` test improve generic `add_sat` implementation for signed types implement `sub_sat` allow more msvc intrinsics on x86 add op tests partially implement `mul_sat` implement `div_sat` and `saturate_cast` add `saturate_cast` test simplify `div_sat` test Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop CUB's util_compiler.cuh (#3302) All contained macros were deprecated Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 move to c++17, finalize device optimization fix msvc compilation, update tests Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 Fix CI issues (#3443) update docs fix review restrict allowed types replace constexpr implementations with generic optimize `__is_arithmetic_integral`

elstehle added 7 commits October 28, 2024 23:41

adds benchmarks for reduce::arg{min,max}

31e555d

preliminary streaming arg-extremum reduction

30b88a1

fixes implicit conversion

3c5f322

uses streaming dispatch class

cc95d62

changes arg benches to use new streaming reduce

dad724e

streaming arg-extrema reduction

3f4cbf4

fixes style

1895d38

elstehle requested review from a team as code owners October 29, 2024 13:09

elstehle requested review from gevtushenko and fbusato October 29, 2024 13:09

elstehle commented Oct 29, 2024

View reviewed changes

cub/cub/device/device_reduce.cuh Outdated Show resolved Hide resolved

bernhardmgruber reviewed Oct 29, 2024

View reviewed changes

fixes compilation failures

ee341e1

elstehle marked this pull request as draft October 30, 2024 07:58

cleanups

1c8a62c

elstehle marked this pull request as ready for review October 30, 2024 13:03

elstehle added 2 commits October 30, 2024 07:06

adds rst style comments

f8cca48

declare vars const and use clamp

757672b

bernhardmgruber self-assigned this Oct 30, 2024

elstehle added 2 commits October 30, 2024 10:06

consolidates argmin argmax benchmarks

14bb6ad

fixes thrust usage

92730b6

elstehle added 5 commits October 30, 2024 21:19

drops offset type in arg-extrema benchmarks

d1cac78

fixes clang cuda

6600bcf

exec space macros

1d6e6b3

switch to signed global offset type for slightly better perf

39bffee

clarifies documentation

4ffe18b

elstehle added 2 commits November 1, 2024 03:39

cleans up aggregate init

0d07c25

renames dispatch class usage in benchmarks

6d559fb

elstehle added 2 commits December 3, 2024 00:58

Merge remote-tracking branch 'upstream/main' into enh/large-num-items…

326d3b7

…-reduce-argminmax

fixes merge conflicts

cad6b8a

bernhardmgruber requested changes Dec 4, 2024

View reviewed changes

cub/benchmarks/bench/reduce/arg_extrema.cu Outdated Show resolved Hide resolved

cub/cub/device/dispatch/dispatch_streaming_reduce.cuh Outdated Show resolved Hide resolved

cub/cub/device/dispatch/dispatch_streaming_reduce.cuh Outdated Show resolved Hide resolved

addresses review comments

e57e453

bernhardmgruber reviewed Dec 5, 2024

View reviewed changes

elstehle added 2 commits December 6, 2024 00:00

addresses review comments

ba7e12b

fixes assertion

952ebc2

bernhardmgruber approved these changes Dec 6, 2024

View reviewed changes

elstehle added the blocked This PR cannot be merged due to various reasons label Dec 12, 2024

elstehle changed the title ~~Adds support for large num_items to DeviceReduce::{ArgMin,ArgMax}~~ [DO NOT MERGE] Adds support for large num_items to DeviceReduce::{ArgMin,ArgMax} Dec 12, 2024

This was referenced Dec 16, 2024

[EPIC]: CUB large input support #50

Open

Introduces new DeviceReduce::Arg{Min,Max} interface with two output iterators #3148

Merged

elstehle added 3 commits December 19, 2024 03:26

Merge branch 'main' into enh/large-num-items-reduce-argminmax

7544fc2

removes superseded implementation

b1afa36

changes large problem tests to use new interface

33393d0

elstehle removed the blocked This PR cannot be merged due to various reasons label Dec 19, 2024

elstehle changed the title ~~[DO NOT MERGE] Adds support for large num_items to DeviceReduce::{ArgMin,ArgMax}~~ Adds support for large num_items to DeviceReduce::{ArgMin,ArgMax} Dec 19, 2024

removes obsolete tests for deprecated interface

1effe5f

elstehle merged commit 3925c37 into NVIDIA:main Dec 20, 2024
110 checks passed

	*out_it = kv_pair_t{static_cast<index_t>(reduced_result.key), reduced_result.value};
	_CCCL_ASSERT(static_cast<OffsetT>(static_cast<index_t>(reduced_result.key)) == reduced_result.key);
	*out_it = kv_pair_t{static_cast<index_t>(reduced_result.key), reduced_result.value};

Adds support for large num_items to DeviceReduce::{ArgMin,ArgMax} #2647

Adds support for large num_items to DeviceReduce::{ArgMin,ArgMax} #2647

Conversation

elstehle commented Oct 29, 2024 • edited Loading

Description

Checklist

bernhardmgruber left a comment

Choose a reason for hiding this comment

elstehle commented Oct 30, 2024

github-actions bot commented Oct 30, 2024

🟨 cub: Pass: 96%/110 | Total: 3d 20h | Avg: 50m 16s | Max: 1h 13m | Hits: 64%/2948

🟩 thrust: Pass: 100%/109 | Total: 2d 06h | Avg: 29m 43s | Max: 58m 06s | Hits: 83%/13165

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 25s | Avg: 4m 42s | Max: 7m 08s

🟩 pycuda: Pass: 100%/1 | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 222)

github-actions bot commented Oct 31, 2024

🟩 cub: Pass: 100%/110 | Total: 3d 18h | Avg: 49m 10s | Max: 1h 07m | Hits: 64%/2948

🟩 thrust: Pass: 100%/109 | Total: 2d 05h | Avg: 29m 18s | Max: 57m 55s | Hits: 83%/13165

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 32s | Avg: 4m 46s | Max: 7m 09s

🟩 pycuda: Pass: 100%/1 | Total: 15m 49s | Avg: 15m 49s | Max: 15m 49s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 222)

github-actions bot commented Nov 1, 2024

🟩 cub: Pass: 100%/110 | Total: 2d 18h | Avg: 36m 06s | Max: 1h 11m | Hits: 89%/2944

🟩 thrust: Pass: 100%/109 | Total: 17h 20m | Avg: 9m 32s | Max: 41m 32s | Hits: 92%/13165

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 48s | Avg: 4m 24s | Max: 6m 38s

🟩 python: Pass: 100%/1 | Total: 13m 41s | Avg: 13m 41s | Max: 13m 41s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 222)

github-actions bot commented Dec 3, 2024

🟨 cub: Pass: 99%/110 | Total: 3d 22h | Avg: 51m 25s | Max: 1h 17m | Hits: 98%/3048

🟩 thrust: Pass: 100%/111 | Total: 13h 15m | Avg: 7m 09s | Max: 27m 42s | Hits: 99%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 14s | Avg: 5m 07s | Max: 7m 44s

🟩 python: Pass: 100%/1 | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 224)

github-actions bot commented Dec 4, 2024

🟨 cub: Pass: 99%/110 | Total: 3d 22h | Avg: 51m 19s | Max: 1h 17m | Hits: 98%/3048

🟩 thrust: Pass: 100%/111 | Total: 13h 15m | Avg: 7m 09s | Max: 27m 42s | Hits: 99%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 14s | Avg: 5m 07s | Max: 7m 44s

🟩 python: Pass: 100%/1 | Total: 14m 14s | Avg: 14m 14s | Max: 14m 14s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 224)

github-actions bot commented Dec 4, 2024

🟨 cub: Pass: 99%/110 | Total: 2d 12h | Avg: 33m 01s | Max: 55m 23s | Hits: 99%/3048

🟩 thrust: Pass: 100%/111 | Total: 13h 26m | Avg: 7m 16s | Max: 29m 32s | Hits: 99%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 36s | Avg: 5m 18s | Max: 8m 11s

🟩 python: Pass: 100%/1 | Total: 15m 25s | Avg: 15m 25s | Max: 15m 25s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 224)

bernhardmgruber left a comment

Choose a reason for hiding this comment

bernhardmgruber Dec 5, 2024

Choose a reason for hiding this comment

github-actions bot commented Dec 19, 2024

🟩 cub: Pass: 100%/47 | Total: 1d 13h | Avg: 47m 55s | Max: 1h 04m | Hits: 63%/3144

🟩 thrust: Pass: 100%/46 | Total: 23h 21m | Avg: 30m 27s | Max: 56m 44s | Hits: 77%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 07s | Avg: 4m 33s | Max: 7m 01s

🟩 python: Pass: 100%/1 | Total: 25m 29s | Avg: 25m 29s | Max: 25m 29s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 96)

github-actions bot commented Dec 19, 2024

🟩 cub: Pass: 100%/47 | Total: 6h 48m | Avg: 8m 41s | Max: 26m 17s | Hits: 98%/3144

🟩 thrust: Pass: 100%/46 | Total: 5h 57m | Avg: 7m 46s | Max: 21m 35s | Hits: 99%/9260

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 56s | Avg: 4m 28s | Max: 6m 59s

Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` #2647

Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` #2647

elstehle commented Oct 29, 2024 •

edited

Loading

Performance comparison for `old.main.i64 offset type` vs. streaming approach:

Performance comparison for `old.main.i32` vs. streaming approach