Skip to content

Change streaming algorithms to use operator+= from using operator+ #4428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

oleksandr-pavlyk
Copy link
Contributor

@oleksandr-pavlyk oleksandr-pavlyk commented Apr 13, 2025

  1. Changed streaming algorithm to support num_segments in segmented_reduce that is greater than INT_MAX from using operator+ to increment iterators on the host to using operator+= to save constructor/copy calls.

  2. Introduce void advance_iterators_inplace_if_supported(IteratorT &iter, OffsetT diff) that use operator+= alongside existing IteratorT advance_iterators_if_supported(IteratorT iter, OffsetT diff).

  3. Since these are used from dispatcher functions annotated as CUB_RUNTIME_FUNCTION, changed these functions's annotations from _CCCL_HOST_DEVICE to CUB_RUNTIME_FUNCTION.

    This change broke test_nvrtc test since dispatch_common.cuh header file where these functions are defined is used in agent_select_if.cuh file that must be compiled by NVRTC.

  4. Hence I moved functions for advancing iterators from dispatch_common.cuh into a new header file dispatch_advance_iterators.cuh which are included from dispatch_reduce.cuh and dispatch_radix_sort.cuh and moved definitiosn of relevant functions from dispatch_common to dispatch_advance_iterators

Description

closes

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

1. Changed streaming algorithm to support num_segments in segmented_reduce
that is greater than INT_MAX from using operator+ to increment iterators
on the host to using operator+= to save constructor/copy calls.

2. Introduce
   `void advance_iterators_inplace_if_supported(IteratorT &iter, OffsetT diff)`
   that use operator+= alongside existing
   `IteratorT advance_iterators_if_supported(IteratorT iter, OffsetT diff)`.

3. Since these are used from dispatcher functions annotated as
   CUB_RUNTIME_FUNCTION, changed these functions's annotations from
   _CCCL_HOST_DEVICE to CUB_RUNTIME_FUNCTION.

   This change broke test_nvrtc test since dispatch_common.cuh header
   file where these functions are defined is used in agent_select_if.cuh
   file that must be compiled by NVRTC.

4. Hence I moved functions for advancing iterators from dispatch_common.cuh
   into a new header file dispatch_advance_iterators.cuh which are included
   from dispatch_reduce.cuh and dispatch_radix_sort.cuh and moved definitiosn
   of relevant functions from dispatch_common to dispatch_advance_iterators
@oleksandr-pavlyk oleksandr-pavlyk requested a review from a team as a code owner April 13, 2025 00:40
@github-project-automation github-project-automation bot moved this to Todo in CCCL Apr 13, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Apr 13, 2025
Copy link
Contributor

🟨 CI finished in 3h 44m: Pass: 95%/103 | Total: 2d 22h | Avg: 41m 02s | Max: 1h 28m | Hits: 75%/134359
  • 🟨 cub: Pass: 89%/47 | Total: 1d 21h | Avg: 58m 05s | Max: 1h 28m | Hits: 64%/50568

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  88%/45  | Total:  1d 19h | Avg: 57m 52s | Max:  1h 28m | Hits:  64%/48102 
      🟩 arm64              Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 03m | Hits:  61%/2466  
    🔍 ctk: 12.8 🔍
      🟩 12.0               Pass: 100%/5   | Total:  5h 36m | Avg:  1h 07m | Max:  1h 11m | Hits:  61%/5994  
      🔍 12.8               Pass:  88%/42  | Total:  1d 15h | Avg: 57m 00s | Max:  1h 28m | Hits:  64%/44574 
    🔍 cudacxx: nvcc12.8 🔍
      🟩 ClangCUDA19        Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 01m | Hits:  66%/2128  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 36m | Avg:  1h 07m | Max:  1h 11m | Hits:  61%/5994  
      🔍 nvcc12.8           Pass:  87%/40  | Total:  1d 13h | Avg: 56m 48s | Max:  1h 28m | Hits:  64%/42446 
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 01m | Hits:  66%/2128  
      🔍 nvcc               Pass:  88%/45  | Total:  1d 19h | Avg: 57m 57s | Max:  1h 28m | Hits:  64%/48440 
    🔍 sm: 90 🔍
      🔍 90                 Pass:  66%/3   | Total:  1h 19m | Avg: 26m 21s | Max: 29m 06s | Hits:  80%/2466  
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 10m | Avg:  1h 10m | Max:  1h 10m | Hits:  60%/1233  
    🔍 std: 20 🔍
      🟩 17                 Pass: 100%/21  | Total: 23h 10m | Avg:  1h 06m | Max:  1h 25m | Hits:  61%/25110 
      🔍 20                 Pass:  80%/26  | Total: 22h 20m | Avg: 51m 32s | Max:  1h 28m | Hits:  67%/25458 
    🟨 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  4h 14m | Avg:  1h 03m | Max:  1h 07m | Hits:  61%/4940  
      🟩 Clang15            Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 06m | Hits:  61%/2466  
      🟩 Clang16            Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 04m | Hits:  61%/2466  
      🟩 Clang17            Pass: 100%/2   | Total:  2h 02m | Avg:  1h 01m | Max:  1h 01m | Hits:  61%/2466  
      🟩 Clang18            Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 00m | Hits:  61%/2466  
      🟨 Clang19            Pass:  85%/7   | Total:  5h 54m | Avg: 50m 39s | Max:  1h 02m | Hits:  69%/7060  
      🟩 GCC7               Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 06m | Hits:  60%/2470  
      🟩 GCC8               Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m | Hits:  60%/1235  
      🟩 GCC9               Pass: 100%/2   | Total:  2h 11m | Avg:  1h 05m | Max:  1h 07m | Hits:  60%/2470  
      🟩 GCC10              Pass: 100%/2   | Total:  2h 11m | Avg:  1h 05m | Max:  1h 08m | Hits:  60%/2470  
      🟩 GCC11              Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 06m | Hits:  60%/2466  
      🟩 GCC12              Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 03m | Hits:  60%/2466  
      🟨 GCC13              Pass:  63%/11  | Total:  7h 16m | Avg: 39m 38s | Max:  1h 10m | Hits:  71%/8631  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 37m | Avg:  1h 18m | Max:  1h 25m | Hits:  65%/2108  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 51m | Avg:  1h 25m | Max:  1h 28m | Hits:  65%/2108  
      🟩 NVHPC25.3          Pass: 100%/2   | Total:  2h 30m | Avg:  1h 15m | Max:  1h 16m | Hits:  60%/2280  
    🟨 cxx_family
      🟨 Clang              Pass:  94%/19  | Total: 18h 24m | Avg: 58m 06s | Max:  1h 07m | Hits:  63%/21864 
      🟨 GCC                Pass:  81%/22  | Total: 19h 07m | Avg: 52m 10s | Max:  1h 10m | Hits:  65%/22208 
      🟩 MSVC               Pass: 100%/4   | Total:  5h 28m | Avg:  1h 22m | Max:  1h 28m | Hits:  65%/4216  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 30m | Avg:  1h 15m | Max:  1h 16m | Hits:  60%/2280  
    🟨 gpu
      🟨 h100               Pass:  66%/3   | Total:  1h 19m | Avg: 26m 21s | Max: 29m 06s | Hits:  80%/2466  
      🟩 rtx2080            Pass: 100%/36  | Total:  1d 15h | Avg:  1h 06m | Max:  1h 28m | Hits:  61%/43170 
      🟨 rtxa6000           Pass:  50%/8   | Total:  4h 30m | Avg: 33m 50s | Max:  1h 02m | Hits:  80%/4932  
    🟨 jobs
      🟩 Build              Pass: 100%/39  | Total:  1d 18h | Avg:  1h 04m | Max:  1h 28m | Hits:  61%/46869 
      🟥 DeviceLaunch       Pass:   0%/1   | Total: 27m 29s | Avg: 27m 29s | Max: 27m 29s
      🟥 GraphCapture       Pass:   0%/1   | Total: 19m 57s | Avg: 19m 57s | Max: 19m 57s
      🟥 HostLaunch         Pass:   0%/3   | Total:  1h 22m | Avg: 27m 39s | Max: 29m 06s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 06m | Avg: 22m 10s | Max: 23m 55s | Hits:  99%/3699  
    
  • 🟩 thrust: Pass: 100%/47 | Total: 23h 38m | Avg: 30m 11s | Max: 1h 05m | Hits: 81%/83463

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 37m 31s | Avg: 18m 45s | Max: 26m 01s | Hits:  89%/3554  
    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total: 22h 43m | Avg: 30m 17s | Max:  1h 05m | Hits:  81%/79910 
      🟩 arm64              Pass: 100%/2   | Total: 55m 39s | Avg: 27m 49s | Max: 30m 05s | Hits:  79%/3553  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  2h 50m | Avg: 34m 08s | Max: 52m 57s | Hits:  78%/8876  
      🟩 12.8               Pass: 100%/42  | Total: 20h 48m | Avg: 29m 43s | Max:  1h 05m | Hits:  81%/74587 
    🟩 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 48m 32s | Avg: 24m 16s | Max: 24m 49s | Hits:  79%/3552  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  2h 50m | Avg: 34m 08s | Max: 52m 57s | Hits:  78%/8876  
      🟩 nvcc12.8           Pass: 100%/40  | Total: 19h 59m | Avg: 29m 59s | Max:  1h 05m | Hits:  82%/71035 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 48m 32s | Avg: 24m 16s | Max: 24m 49s | Hits:  79%/3552  
      🟩 nvcc               Pass: 100%/45  | Total: 22h 50m | Avg: 30m 27s | Max:  1h 05m | Hits:  81%/79911 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  1h 54m | Avg: 28m 43s | Max: 29m 45s | Hits:  79%/7104  
      🟩 Clang15            Pass: 100%/2   | Total: 59m 43s | Avg: 29m 51s | Max: 31m 50s | Hits:  79%/3552  
      🟩 Clang16            Pass: 100%/2   | Total: 57m 09s | Avg: 28m 34s | Max: 29m 10s | Hits:  79%/3552  
      🟩 Clang17            Pass: 100%/2   | Total: 57m 59s | Avg: 28m 59s | Max: 29m 55s | Hits:  79%/3552  
      🟩 Clang18            Pass: 100%/2   | Total: 58m 56s | Avg: 29m 28s | Max: 30m 33s | Hits:  79%/3552  
      🟩 Clang19            Pass: 100%/7   | Total:  2h 27m | Avg: 21m 00s | Max: 28m 07s | Hits:  85%/12432 
      🟩 GCC7               Pass: 100%/2   | Total: 59m 54s | Avg: 29m 57s | Max: 30m 07s | Hits:  79%/3554  
      🟩 GCC8               Pass: 100%/1   | Total: 30m 44s | Avg: 30m 44s | Max: 30m 44s | Hits:  79%/1777  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 01m | Avg: 30m 33s | Max: 31m 15s | Hits:  79%/3554  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 01m | Avg: 30m 47s | Max: 32m 13s | Hits:  79%/3554  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 02m | Avg: 31m 03s | Max: 33m 49s | Hits:  79%/3554  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 03m | Avg: 31m 56s | Max: 34m 01s | Hits:  79%/3554  
      🟩 GCC13              Pass: 100%/10  | Total:  3h 29m | Avg: 20m 59s | Max: 31m 21s | Hits:  87%/17770 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 45m | Avg: 52m 55s | Max: 52m 57s | Hits:  73%/3540  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 24m | Avg: 48m 06s | Max:  1h 00m | Hits:  81%/5310  
      🟩 NVHPC25.3          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 05m | Hits:  73%/3552  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  8h 15m | Avg: 26m 05s | Max: 31m 50s | Hits:  81%/33744 
      🟩 GCC                Pass: 100%/21  | Total:  9h 09m | Avg: 26m 09s | Max: 34m 01s | Hits:  83%/37317 
      🟩 MSVC               Pass: 100%/5   | Total:  4h 10m | Avg: 50m 01s | Max:  1h 00m | Hits:  78%/8850  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 05m | Hits:  73%/3552  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 29m 00s | Avg: 14m 30s | Max: 16m 20s | Hits:  89%/3554  
      🟩 rtx2080            Pass: 100%/35  | Total: 19h 32m | Avg: 33m 29s | Max:  1h 05m | Hits:  78%/62156 
      🟩 rtx4090            Pass: 100%/10  | Total:  3h 37m | Avg: 21m 45s | Max: 54m 50s | Hits:  90%/17753 
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total: 22h 07m | Avg: 33m 11s | Max:  1h 05m | Hits:  78%/71033 
      🟩 TestCPU            Pass: 100%/3   | Total: 44m 58s | Avg: 14m 59s | Max: 29m 03s | Hits:  99%/5323  
      🟩 TestGPU            Pass: 100%/4   | Total: 46m 29s | Avg: 11m 37s | Max: 12m 40s | Hits:  99%/7107  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 29m 00s | Avg: 14m 30s | Max: 16m 20s | Hits:  89%/3554  
      🟩 90;90a;100         Pass: 100%/1   | Total: 30m 52s | Avg: 30m 52s | Max: 30m 52s | Hits:  79%/1777  
    🟩 std
      🟩 17                 Pass: 100%/21  | Total: 12h 24m | Avg: 35m 28s | Max:  1h 05m | Hits:  78%/37287 
      🟩 20                 Pass: 100%/24  | Total: 10h 36m | Avg: 26m 31s | Max: 58m 27s | Hits:  83%/42622 
    
  • 🟩 stdpar: Pass: 100%/4 | Total: 18m 17s | Avg: 4m 34s | Max: 5m 18s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 26s | Avg:  5m 13s | Max:  5m 18s
      🟩 arm64              Pass: 100%/2   | Total:  7m 51s | Avg:  3m 55s | Max:  4m 00s
    🟩 ctk
      🟩 12.8               Pass: 100%/4   | Total: 18m 17s | Avg:  4m 34s | Max:  5m 18s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/4   | Total: 18m 17s | Avg:  4m 34s | Max:  5m 18s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 18m 17s | Avg:  4m 34s | Max:  5m 18s
    🟩 cxx
      🟩 NVHPC25.3          Pass: 100%/4   | Total: 18m 17s | Avg:  4m 34s | Max:  5m 18s
    🟩 cxx_family
      🟩 NVHPC              Pass: 100%/4   | Total: 18m 17s | Avg:  4m 34s | Max:  5m 18s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 18m 17s | Avg:  4m 34s | Max:  5m 18s
    🟩 jobs
      🟩 Build              Pass: 100%/4   | Total: 18m 17s | Avg:  4m 34s | Max:  5m 18s
    🟩 std
      🟩 17                 Pass: 100%/2   | Total:  8m 59s | Avg:  4m 29s | Max:  5m 08s
      🟩 20                 Pass: 100%/2   | Total:  9m 18s | Avg:  4m 39s | Max:  5m 18s
    
  • 🟩 python: Pass: 100%/3 | Total: 35m 27s | Avg: 11m 49s | Max: 18m 58s

    🟩 cpu
      🟩 amd64              Pass: 100%/3   | Total: 35m 27s | Avg: 11m 49s | Max: 18m 58s
    🟩 ctk
      🟩 12.8               Pass: 100%/3   | Total: 35m 27s | Avg: 11m 49s | Max: 18m 58s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/3   | Total: 35m 27s | Avg: 11m 49s | Max: 18m 58s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/3   | Total: 35m 27s | Avg: 11m 49s | Max: 18m 58s
    🟩 cxx
      🟩 GCC13              Pass: 100%/3   | Total: 35m 27s | Avg: 11m 49s | Max: 18m 58s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/3   | Total: 35m 27s | Avg: 11m 49s | Max: 18m 58s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/3   | Total: 35m 27s | Avg: 11m 49s | Max: 18m 58s
    🟩 jobs
      🟩 cuda.cccl          Pass: 100%/1   | Total:  3m 50s | Avg:  3m 50s | Max:  3m 50s
      🟩 cuda.cooperative   Pass: 100%/1   | Total: 18m 58s | Avg: 18m 58s | Max: 18m 58s
      🟩 cuda.parallel      Pass: 100%/1   | Total: 12m 39s | Avg: 12m 39s | Max: 12m 39s
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 24m 34s | Avg: 12m 17s | Max: 22m 10s | Hits: 97%/328

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 24m 34s | Avg: 12m 17s | Max: 22m 10s | Hits:  97%/328   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 24m 34s | Avg: 12m 17s | Max: 22m 10s | Hits:  97%/328   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 24m 34s | Avg: 12m 17s | Max: 22m 10s | Hits:  97%/328   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 24m 34s | Avg: 12m 17s | Max: 22m 10s | Hits:  97%/328   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 24m 34s | Avg: 12m 17s | Max: 22m 10s | Hits:  97%/328   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 24m 34s | Avg: 12m 17s | Max: 22m 10s | Hits:  97%/328   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 24m 34s | Avg: 12m 17s | Max: 22m 10s | Hits:  97%/328   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 24s | Avg:  2m 24s | Max:  2m 24s | Hits:  96%/164   
      🟩 Test               Pass: 100%/1   | Total: 22m 10s | Avg: 22m 10s | Max: 22m 10s | Hits:  98%/164   
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- stdpar
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 103)

# Runner
72 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-arm64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-amd64-gpu-rtx2080-latest-1
3 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1

Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those functions are really dangerous and I would strongly prefer to just use cuda::std::advance rather than those

// Helper function that advances a given iterator only if it supports being advanced by the given offset
template <typename IteratorT, typename OffsetT>
CUB_RUNTIME_FUNCTION _CCCL_VISIBILITY_HIDDEN _CCCL_FORCEINLINE IteratorT
advance_iterators_if_supported(IteratorT iter, [[maybe_unused]] OffsetT offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dont we just use cuda::std::advance?

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Apr 14, 2025
@oleksandr-pavlyk
Copy link
Contributor Author

oleksandr-pavlyk commented Apr 14, 2025

For the sake of my own understanding, the danger comes from current code disregarding iterator tags?

@miscco
Copy link
Contributor

miscco commented Apr 14, 2025

For the sake of my own understanding, the danger comes from current code disregarding iterator tags?

The danger comes from operation conditionally doing things. We should ensure that we always properly combine the "fast" path with the fallback

As written I see this as quite dangerous and I also do not understand why we cannot just use cuda::std::advance

Comment on lines 1311 to 1312
detail::advance_iterators_inplace_if_supported(d_begin_offsets, num_current_segments);
detail::advance_iterators_inplace_if_supported(d_end_offsets, num_current_segments);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be

Suggested change
detail::advance_iterators_inplace_if_supported(d_begin_offsets, num_current_segments);
detail::advance_iterators_inplace_if_supported(d_end_offsets, num_current_segments);
::cuda::std::advance(d_begin_offsets, num_current_segments);
::cuda::std::advance(d_end_offsets, num_current_segments);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, making this change would break c.parallel build. We would still need

if constexpr (has_add_assign_operator) {
   :cuda::std::advance(d_begin_offsets, num_current_segments);
}

@oleksandr-pavlyk
Copy link
Contributor Author

@miscco The conditional around increments are done are done to support compiling of indirect_arg_t which does not implement neither operator+ nor operator+=. Conditionals ensure that compiler would simply omit increments resulting in code with incorrect logic.

A run-time check is inserted before the increment iteration loop to ensure that code execution never reaches the loop.

The ::cuda::std::advance could be used instead of it += n in advance_iterators_inplace_if_supported. It could also be used in advance_iterators_if_supported by creating a temporary with the argument copy and modifying it inplace.

@oleksandr-pavlyk
Copy link
Contributor Author

The only downside (consequence) of using ::cuda::std::advance would be that host-incrementable indirect iterator needed in gh-4148 would need to provide all iterator traits, rather than just addition operators.

@miscco
Copy link
Contributor

miscco commented Apr 14, 2025

indirect_arg_t

Why dont we implement += for indirect_arg_t

@oleksandr-pavlyk
Copy link
Contributor Author

Why dont we implement += for indirect_arg_t

Because indirect_arg_t deals with type-erased iterator state that is user-defined in Python. All c++ knows is

struct indirect_arg_t
{
  void* ptr;

  indirect_arg_t(cccl_iterator_t& it)
      : ptr(it.type == cccl_iterator_kind_t::CCCL_POINTER ? &it.state : it.state)
  {}

  void* operator&() const
  {
    return ptr;
  }
};

This is because InvokePass may be called multiple times by InvokePasses due
to algorithmic nature of radix sorting. With this chanage, InvokePass creates
local copies of `d_begin_offsets` and `d_end_offsets` and advances these copies
in-place if necessary.
@oleksandr-pavlyk
Copy link
Contributor Author

@elstehle identified the reason for the CI failures. Transition to advance iterators in-place in DispatchSegmentedRadixSort::InvokePass resulted in class state mutation, and since InvokePass may be called repeatedly by InvokePasses for wider data-types this caused out-of-bounds access violation.

The solution is for DispatchSegmentedRadixSort::InvokePass to create private copies of offsets iterators and advance these in-place and hence leaving the DispatchSegmentedRadixSort state unmodified.

Copy link
Contributor

🟩 CI finished in 2h 06m: Pass: 100%/103 | Total: 2d 23h | Avg: 41m 38s | Max: 1h 30m | Hits: 76%/140524
  • 🟩 cub: Pass: 100%/47 | Total: 1d 22h | Avg: 58m 48s | Max: 1h 30m | Hits: 68%/56733

    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total:  1d 19h | Avg: 58m 31s | Max:  1h 30m | Hits:  68%/54267 
      🟩 arm64              Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 08m | Hits:  61%/2466  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  5h 26m | Avg:  1h 05m | Max:  1h 09m | Hits:  61%/5994  
      🟩 12.8               Pass: 100%/42  | Total:  1d 16h | Avg: 58m 02s | Max:  1h 30m | Hits:  69%/50739 
    🟩 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total:  1h 55m | Avg: 57m 41s | Max: 57m 44s | Hits:  66%/2128  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 26m | Avg:  1h 05m | Max:  1h 09m | Hits:  61%/5994  
      🟩 nvcc12.8           Pass: 100%/40  | Total:  1d 14h | Avg: 58m 03s | Max:  1h 30m | Hits:  69%/48611 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 55m | Avg: 57m 41s | Max: 57m 44s | Hits:  66%/2128  
      🟩 nvcc               Pass: 100%/45  | Total:  1d 20h | Avg: 58m 51s | Max:  1h 30m | Hits:  68%/54605 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  4h 11m | Avg:  1h 02m | Max:  1h 06m | Hits:  61%/4940  
      🟩 Clang15            Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 01m | Hits:  61%/2466  
      🟩 Clang16            Pass: 100%/2   | Total:  2h 11m | Avg:  1h 05m | Max:  1h 10m | Hits:  61%/2466  
      🟩 Clang17            Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 03m | Hits:  61%/2466  
      🟩 Clang18            Pass: 100%/2   | Total:  2h 11m | Avg:  1h 05m | Max:  1h 08m | Hits:  61%/2466  
      🟩 Clang19            Pass: 100%/7   | Total:  5h 55m | Avg: 50m 47s | Max:  1h 06m | Hits:  74%/8293  
      🟩 GCC7               Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 04m | Hits:  60%/2470  
      🟩 GCC8               Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m | Hits:  60%/1235  
      🟩 GCC9               Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 07m | Hits:  60%/2470  
      🟩 GCC10              Pass: 100%/2   | Total:  2h 15m | Avg:  1h 07m | Max:  1h 11m | Hits:  60%/2470  
      🟩 GCC11              Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 03m | Hits:  60%/2466  
      🟩 GCC12              Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 15m | Hits:  60%/2466  
      🟩 GCC13              Pass: 100%/11  | Total:  7h 38m | Avg: 41m 42s | Max:  1h 12m | Hits:  82%/13563 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 39m | Avg:  1h 19m | Max:  1h 30m | Hits:  65%/2108  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 39m | Avg:  1h 19m | Max:  1h 22m | Hits:  65%/2108  
      🟩 NVHPC25.3          Pass: 100%/2   | Total:  2h 26m | Avg:  1h 13m | Max:  1h 13m | Hits:  60%/2280  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 18h 38m | Avg: 58m 53s | Max:  1h 10m | Hits:  65%/23097 
      🟩 GCC                Pass: 100%/22  | Total: 19h 40m | Avg: 53m 39s | Max:  1h 15m | Hits:  71%/27140 
      🟩 MSVC               Pass: 100%/4   | Total:  5h 18m | Avg:  1h 19m | Max:  1h 30m | Hits:  65%/4216  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 26m | Avg:  1h 13m | Max:  1h 13m | Hits:  60%/2280  
    🟩 gpu
      🟩 h100               Pass: 100%/3   | Total:  1h 18m | Avg: 26m 00s | Max: 26m 35s | Hits:  86%/3699  
      🟩 rtx2080            Pass: 100%/36  | Total:  1d 15h | Avg:  1h 06m | Max:  1h 30m | Hits:  61%/43170 
      🟩 rtxa6000           Pass: 100%/8   | Total:  4h 53m | Avg: 36m 41s | Max:  1h 06m | Hits:  90%/9864  
    🟩 jobs
      🟩 Build              Pass: 100%/39  | Total:  1d 18h | Avg:  1h 05m | Max:  1h 30m | Hits:  61%/46869 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 30m 56s | Avg: 30m 56s | Max: 30m 56s | Hits:  99%/1233  
      🟩 GraphCapture       Pass: 100%/1   | Total: 23m 48s | Avg: 23m 48s | Max: 23m 48s | Hits:  99%/1233  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 24m | Avg: 28m 12s | Max: 30m 48s | Hits:  99%/3699  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 14m | Avg: 24m 43s | Max: 27m 10s | Hits:  99%/3699  
    🟩 sm
      🟩 90                 Pass: 100%/3   | Total:  1h 18m | Avg: 26m 00s | Max: 26m 35s | Hits:  86%/3699  
      🟩 90;90a;100         Pass: 100%/1   | Total:  1h 12m | Avg:  1h 12m | Max:  1h 12m | Hits:  60%/1233  
    🟩 std
      🟩 17                 Pass: 100%/21  | Total: 23h 16m | Avg:  1h 06m | Max:  1h 30m | Hits:  61%/25110 
      🟩 20                 Pass: 100%/26  | Total: 22h 47m | Avg: 52m 36s | Max:  1h 22m | Hits:  73%/31623 
    
  • 🟩 thrust: Pass: 100%/47 | Total: 1d 00h | Avg: 30m 54s | Max: 1h 09m | Hits: 81%/83463

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 37m 41s | Avg: 18m 50s | Max: 26m 14s | Hits:  89%/3554  
    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total: 23h 18m | Avg: 31m 04s | Max:  1h 09m | Hits:  81%/79910 
      🟩 arm64              Pass: 100%/2   | Total: 54m 17s | Avg: 27m 08s | Max: 28m 38s | Hits:  79%/3553  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  2h 53m | Avg: 34m 42s | Max: 52m 56s | Hits:  78%/8876  
      🟩 12.8               Pass: 100%/42  | Total: 21h 18m | Avg: 30m 26s | Max:  1h 09m | Hits:  81%/74587 
    🟩 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 51m 12s | Avg: 25m 36s | Max: 27m 09s | Hits:  79%/3552  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  2h 53m | Avg: 34m 42s | Max: 52m 56s | Hits:  78%/8876  
      🟩 nvcc12.8           Pass: 100%/40  | Total: 20h 27m | Avg: 30m 41s | Max:  1h 09m | Hits:  82%/71035 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 51m 12s | Avg: 25m 36s | Max: 27m 09s | Hits:  79%/3552  
      🟩 nvcc               Pass: 100%/45  | Total: 23h 21m | Avg: 31m 08s | Max:  1h 09m | Hits:  81%/79911 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  1h 53m | Avg: 28m 23s | Max: 31m 20s | Hits:  79%/7104  
      🟩 Clang15            Pass: 100%/2   | Total: 57m 14s | Avg: 28m 37s | Max: 29m 54s | Hits:  79%/3552  
      🟩 Clang16            Pass: 100%/2   | Total:  1h 04m | Avg: 32m 11s | Max: 34m 48s | Hits:  79%/3552  
      🟩 Clang17            Pass: 100%/2   | Total:  1h 00m | Avg: 30m 06s | Max: 31m 18s | Hits:  79%/3552  
      🟩 Clang18            Pass: 100%/2   | Total:  1h 02m | Avg: 31m 18s | Max: 35m 53s | Hits:  79%/3552  
      🟩 Clang19            Pass: 100%/7   | Total:  2h 37m | Avg: 22m 28s | Max: 34m 10s | Hits:  85%/12432 
      🟩 GCC7               Pass: 100%/2   | Total:  1h 02m | Avg: 31m 18s | Max: 33m 02s | Hits:  79%/3554  
      🟩 GCC8               Pass: 100%/1   | Total: 29m 18s | Avg: 29m 18s | Max: 29m 18s | Hits:  79%/1777  
      🟩 GCC9               Pass: 100%/2   | Total:  1h 07m | Avg: 33m 41s | Max: 36m 19s | Hits:  79%/3554  
      🟩 GCC10              Pass: 100%/2   | Total:  1h 01m | Avg: 30m 34s | Max: 31m 12s | Hits:  79%/3554  
      🟩 GCC11              Pass: 100%/2   | Total:  1h 00m | Avg: 30m 07s | Max: 30m 31s | Hits:  79%/3554  
      🟩 GCC12              Pass: 100%/2   | Total:  1h 04m | Avg: 32m 11s | Max: 34m 15s | Hits:  79%/3554  
      🟩 GCC13              Pass: 100%/10  | Total:  3h 34m | Avg: 21m 29s | Max: 34m 53s | Hits:  87%/17770 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 49m | Avg: 54m 55s | Max: 56m 55s | Hits:  73%/3540  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 20m | Avg: 46m 41s | Max: 56m 31s | Hits:  81%/5310  
      🟩 NVHPC25.3          Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 09m | Hits:  73%/3552  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  8h 35m | Avg: 27m 07s | Max: 35m 53s | Hits:  81%/33744 
      🟩 GCC                Pass: 100%/21  | Total:  9h 19m | Avg: 26m 39s | Max: 36m 19s | Hits:  83%/37317 
      🟩 MSVC               Pass: 100%/5   | Total:  4h 09m | Avg: 49m 59s | Max: 56m 55s | Hits:  78%/8850  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 09m | Hits:  73%/3552  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 28m 46s | Avg: 14m 23s | Max: 16m 48s | Hits:  89%/3554  
      🟩 rtx2080            Pass: 100%/35  | Total: 19h 54m | Avg: 34m 08s | Max:  1h 09m | Hits:  78%/62156 
      🟩 rtx4090            Pass: 100%/10  | Total:  3h 48m | Avg: 22m 51s | Max: 54m 49s | Hits:  90%/17753 
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total: 22h 41m | Avg: 34m 02s | Max:  1h 09m | Hits:  78%/71033 
      🟩 TestCPU            Pass: 100%/3   | Total: 45m 19s | Avg: 15m 06s | Max: 28m 45s | Hits:  99%/5323  
      🟩 TestGPU            Pass: 100%/4   | Total: 45m 17s | Avg: 11m 19s | Max: 11m 58s | Hits:  99%/7107  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 28m 46s | Avg: 14m 23s | Max: 16m 48s | Hits:  89%/3554  
      🟩 90;90a;100         Pass: 100%/1   | Total: 30m 25s | Avg: 30m 25s | Max: 30m 25s | Hits:  79%/1777  
    🟩 std
      🟩 17                 Pass: 100%/21  | Total: 12h 50m | Avg: 36m 40s | Max:  1h 09m | Hits:  78%/37287 
      🟩 20                 Pass: 100%/24  | Total: 10h 44m | Avg: 26m 51s | Max: 57m 14s | Hits:  83%/42622 
    
  • 🟩 stdpar: Pass: 100%/4 | Total: 18m 53s | Avg: 4m 43s | Max: 5m 24s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  5m 24s
      🟩 arm64              Pass: 100%/2   | Total:  8m 15s | Avg:  4m 07s | Max:  4m 13s
    🟩 ctk
      🟩 12.8               Pass: 100%/4   | Total: 18m 53s | Avg:  4m 43s | Max:  5m 24s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/4   | Total: 18m 53s | Avg:  4m 43s | Max:  5m 24s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 18m 53s | Avg:  4m 43s | Max:  5m 24s
    🟩 cxx
      🟩 NVHPC25.3          Pass: 100%/4   | Total: 18m 53s | Avg:  4m 43s | Max:  5m 24s
    🟩 cxx_family
      🟩 NVHPC              Pass: 100%/4   | Total: 18m 53s | Avg:  4m 43s | Max:  5m 24s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 18m 53s | Avg:  4m 43s | Max:  5m 24s
    🟩 jobs
      🟩 Build              Pass: 100%/4   | Total: 18m 53s | Avg:  4m 43s | Max:  5m 24s
    🟩 std
      🟩 17                 Pass: 100%/2   | Total:  9m 16s | Avg:  4m 38s | Max:  5m 14s
      🟩 20                 Pass: 100%/2   | Total:  9m 37s | Avg:  4m 48s | Max:  5m 24s
    
  • 🟩 python: Pass: 100%/3 | Total: 29m 59s | Avg: 9m 59s | Max: 20m 33s

    🟩 cpu
      🟩 amd64              Pass: 100%/3   | Total: 29m 59s | Avg:  9m 59s | Max: 20m 33s
    🟩 ctk
      🟩 12.8               Pass: 100%/3   | Total: 29m 59s | Avg:  9m 59s | Max: 20m 33s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/3   | Total: 29m 59s | Avg:  9m 59s | Max: 20m 33s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/3   | Total: 29m 59s | Avg:  9m 59s | Max: 20m 33s
    🟩 cxx
      🟩 GCC13              Pass: 100%/3   | Total: 29m 59s | Avg:  9m 59s | Max: 20m 33s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/3   | Total: 29m 59s | Avg:  9m 59s | Max: 20m 33s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/3   | Total: 29m 59s | Avg:  9m 59s | Max: 20m 33s
    🟩 jobs
      🟩 cuda.cccl          Pass: 100%/1   | Total:  2m 58s | Avg:  2m 58s | Max:  2m 58s
      🟩 cuda.cooperative   Pass: 100%/1   | Total: 20m 33s | Avg: 20m 33s | Max: 20m 33s
      🟩 cuda.parallel      Pass: 100%/1   | Total:  6m 28s | Avg:  6m 28s | Max:  6m 28s
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 23m 52s | Avg: 11m 56s | Max: 21m 14s | Hits: 97%/328

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 23m 52s | Avg: 11m 56s | Max: 21m 14s | Hits:  97%/328   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 23m 52s | Avg: 11m 56s | Max: 21m 14s | Hits:  97%/328   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 23m 52s | Avg: 11m 56s | Max: 21m 14s | Hits:  97%/328   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 23m 52s | Avg: 11m 56s | Max: 21m 14s | Hits:  97%/328   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 23m 52s | Avg: 11m 56s | Max: 21m 14s | Hits:  97%/328   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 23m 52s | Avg: 11m 56s | Max: 21m 14s | Hits:  97%/328   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 23m 52s | Avg: 11m 56s | Max: 21m 14s | Hits:  97%/328   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 38s | Avg:  2m 38s | Max:  2m 38s | Hits:  96%/164   
      🟩 Test               Pass: 100%/1   | Total: 21m 14s | Avg: 21m 14s | Max: 21m 14s | Hits:  98%/164   
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- stdpar
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 103)

# Runner
72 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-arm64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-amd64-gpu-rtx2080-latest-1
3 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1

Copy link
Contributor

🟩 CI finished in 1h 38m: Pass: 100%/103 | Total: 1d 02h | Avg: 15m 13s | Max: 1h 19m | Hits: 96%/140524
  • 🟩 cub: Pass: 100%/47 | Total: 14h 36m | Avg: 18m 39s | Max: 1h 19m | Hits: 96%/56733

    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total: 14h 25m | Avg: 19m 13s | Max:  1h 19m | Hits:  96%/54267 
      🟩 arm64              Pass: 100%/2   | Total: 11m 46s | Avg:  5m 53s | Max:  6m 17s | Hits:  99%/2466  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 39m | Avg: 19m 57s | Max:  1h 15m | Hits:  95%/5994  
      🟩 12.8               Pass: 100%/42  | Total: 12h 57m | Avg: 18m 30s | Max:  1h 19m | Hits:  96%/50739 
    🟩 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 10m 20s | Avg:  5m 10s | Max:  5m 17s | Hits: 100%/2128  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 39m | Avg: 19m 57s | Max:  1h 15m | Hits:  95%/5994  
      🟩 nvcc12.8           Pass: 100%/40  | Total: 12h 46m | Avg: 19m 10s | Max:  1h 19m | Hits:  96%/48611 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 20s | Avg:  5m 10s | Max:  5m 17s | Hits: 100%/2128  
      🟩 nvcc               Pass: 100%/45  | Total: 14h 26m | Avg: 19m 15s | Max:  1h 19m | Hits:  96%/54605 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 23m 49s | Avg:  5m 57s | Max:  6m 20s | Hits: 100%/4940  
      🟩 Clang15            Pass: 100%/2   | Total: 13m 15s | Avg:  6m 37s | Max:  6m 42s | Hits: 100%/2466  
      🟩 Clang16            Pass: 100%/2   | Total: 13m 29s | Avg:  6m 44s | Max:  6m 46s | Hits:  99%/2466  
      🟩 Clang17            Pass: 100%/2   | Total: 12m 53s | Avg:  6m 26s | Max:  6m 36s | Hits: 100%/2466  
      🟩 Clang18            Pass: 100%/2   | Total: 12m 49s | Avg:  6m 24s | Max:  6m 29s | Hits: 100%/2466  
      🟩 Clang19            Pass: 100%/7   | Total:  1h 18m | Avg: 11m 16s | Max: 25m 40s | Hits: 100%/8293  
      🟩 GCC7               Pass: 100%/2   | Total: 13m 28s | Avg:  6m 44s | Max:  7m 10s | Hits:  99%/2470  
      🟩 GCC8               Pass: 100%/1   | Total:  6m 39s | Avg:  6m 39s | Max:  6m 39s | Hits:  99%/1235  
      🟩 GCC9               Pass: 100%/2   | Total: 13m 41s | Avg:  6m 50s | Max:  7m 18s | Hits:  99%/2470  
      🟩 GCC10              Pass: 100%/2   | Total: 13m 22s | Avg:  6m 41s | Max:  6m 55s | Hits:  99%/2470  
      🟩 GCC11              Pass: 100%/2   | Total: 14m 10s | Avg:  7m 05s | Max:  7m 23s | Hits:  99%/2466  
      🟩 GCC12              Pass: 100%/2   | Total: 14m 20s | Avg:  7m 10s | Max:  7m 12s | Hits:  99%/2466  
      🟩 GCC13              Pass: 100%/11  | Total:  3h 07m | Avg: 17m 02s | Max: 30m 09s | Hits:  99%/13563 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 35m | Avg:  1h 17m | Max:  1h 19m | Hits:  73%/2108  
      🟩 MSVC14.42          Pass: 100%/2   | Total:  2h 33m | Avg:  1h 16m | Max:  1h 16m | Hits:  73%/2108  
      🟩 NVHPC25.3          Pass: 100%/2   | Total:  2h 29m | Avg:  1h 14m | Max:  1h 19m | Hits:  66%/2280  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  2h 35m | Avg:  8m 10s | Max: 25m 40s | Hits:  99%/23097 
      🟩 GCC                Pass: 100%/22  | Total:  4h 23m | Avg: 11m 57s | Max: 30m 09s | Hits:  99%/27140 
      🟩 MSVC               Pass: 100%/4   | Total:  5h 08m | Avg:  1h 17m | Max:  1h 19m | Hits:  73%/4216  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 29m | Avg:  1h 14m | Max:  1h 19m | Hits:  66%/2280  
    🟩 gpu
      🟩 h100               Pass: 100%/3   | Total: 54m 33s | Avg: 18m 11s | Max: 27m 28s | Hits:  99%/3699  
      🟩 rtx2080            Pass: 100%/36  | Total: 10h 54m | Avg: 18m 10s | Max:  1h 19m | Hits:  95%/43170 
      🟩 rtxa6000           Pass: 100%/8   | Total:  2h 48m | Avg: 21m 02s | Max: 30m 09s | Hits:  99%/9864  
    🟩 jobs
      🟩 Build              Pass: 100%/39  | Total: 11h 13m | Avg: 17m 16s | Max:  1h 19m | Hits:  95%/46869 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 28m 49s | Avg: 28m 49s | Max: 28m 49s | Hits:  99%/1233  
      🟩 GraphCapture       Pass: 100%/1   | Total: 23m 20s | Avg: 23m 20s | Max: 23m 20s | Hits:  99%/1233  
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 23m | Avg: 27m 45s | Max: 30m 09s | Hits:  99%/3699  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 08m | Avg: 22m 41s | Max: 24m 05s | Hits:  99%/3699  
    🟩 sm
      🟩 90                 Pass: 100%/3   | Total: 54m 33s | Avg: 18m 11s | Max: 27m 28s | Hits:  99%/3699  
      🟩 90;90a;100         Pass: 100%/1   | Total:  7m 35s | Avg:  7m 35s | Max:  7m 35s | Hits:  99%/1233  
    🟩 std
      🟩 17                 Pass: 100%/21  | Total:  7h 03m | Avg: 20m 09s | Max:  1h 19m | Hits:  94%/25110 
      🟩 20                 Pass: 100%/26  | Total:  7h 33m | Avg: 17m 26s | Max:  1h 16m | Hits:  97%/31623 
    
  • 🟩 thrust: Pass: 100%/47 | Total: 10h 16m | Avg: 13m 07s | Max: 1h 03m | Hits: 95%/83463

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 18m 33s | Avg:  9m 16s | Max: 11m 29s | Hits:  99%/3554  
    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total: 10h 06m | Avg: 13m 28s | Max:  1h 03m | Hits:  95%/79910 
      🟩 arm64              Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  5m 22s | Hits:  99%/3553  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 08m | Avg: 13m 43s | Max: 49m 03s | Hits:  98%/8876  
      🟩 12.8               Pass: 100%/42  | Total:  9h 08m | Avg: 13m 03s | Max:  1h 03m | Hits:  95%/74587 
    🟩 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  5m 20s | Hits: 100%/3552  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 08m | Avg: 13m 43s | Max: 49m 03s | Hits:  98%/8876  
      🟩 nvcc12.8           Pass: 100%/40  | Total:  8h 57m | Avg: 13m 26s | Max:  1h 03m | Hits:  95%/71035 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  5m 20s | Hits: 100%/3552  
      🟩 nvcc               Pass: 100%/45  | Total: 10h 06m | Avg: 13m 28s | Max:  1h 03m | Hits:  95%/79911 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 20m 58s | Avg:  5m 14s | Max:  5m 44s | Hits: 100%/7104  
      🟩 Clang15            Pass: 100%/2   | Total: 11m 50s | Avg:  5m 55s | Max:  5m 57s | Hits: 100%/3552  
      🟩 Clang16            Pass: 100%/2   | Total: 11m 22s | Avg:  5m 41s | Max:  5m 53s | Hits: 100%/3552  
      🟩 Clang17            Pass: 100%/2   | Total: 12m 15s | Avg:  6m 07s | Max:  6m 21s | Hits: 100%/3552  
      🟩 Clang18            Pass: 100%/2   | Total: 11m 26s | Avg:  5m 43s | Max:  6m 06s | Hits: 100%/3552  
      🟩 Clang19            Pass: 100%/7   | Total: 44m 19s | Avg:  6m 19s | Max: 10m 22s | Hits: 100%/12432 
      🟩 GCC7               Pass: 100%/2   | Total: 10m 30s | Avg:  5m 15s | Max:  5m 38s | Hits:  99%/3554  
      🟩 GCC8               Pass: 100%/1   | Total:  5m 53s | Avg:  5m 53s | Max:  5m 53s | Hits:  99%/1777  
      🟩 GCC9               Pass: 100%/2   | Total: 11m 49s | Avg:  5m 54s | Max:  6m 53s | Hits:  99%/3554  
      🟩 GCC10              Pass: 100%/2   | Total: 11m 57s | Avg:  5m 58s | Max:  6m 06s | Hits:  99%/3554  
      🟩 GCC11              Pass: 100%/2   | Total: 11m 44s | Avg:  5m 52s | Max:  6m 06s | Hits:  99%/3554  
      🟩 GCC12              Pass: 100%/2   | Total: 12m 06s | Avg:  6m 03s | Max:  6m 09s | Hits:  99%/3554  
      🟩 GCC13              Pass: 100%/10  | Total:  1h 17m | Avg:  7m 46s | Max: 11m 35s | Hits:  99%/17770 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 38m | Avg: 49m 13s | Max: 49m 24s | Hits:  90%/3540  
      🟩 MSVC14.42          Pass: 100%/3   | Total:  2h 22m | Avg: 47m 37s | Max: 48m 05s | Hits:  60%/5310  
      🟩 NVHPC25.3          Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 03m | Hits:  74%/3552  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  1h 52m | Avg:  5m 54s | Max: 10m 22s | Hits: 100%/33744 
      🟩 GCC                Pass: 100%/21  | Total:  2h 21m | Avg:  6m 44s | Max: 11m 35s | Hits:  99%/37317 
      🟩 MSVC               Pass: 100%/5   | Total:  4h 01m | Avg: 48m 16s | Max: 49m 24s | Hits:  72%/8850  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 03m | Hits:  74%/3552  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 16m 01s | Avg:  8m 00s | Max: 10m 54s | Hits:  99%/3554  
      🟩 rtx2080            Pass: 100%/35  | Total:  7h 17m | Avg: 12m 30s | Max:  1h 03m | Hits:  97%/62156 
      🟩 rtx4090            Pass: 100%/10  | Total:  2h 42m | Avg: 16m 16s | Max: 48m 05s | Hits:  89%/17753 
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  8h 28m | Avg: 12m 42s | Max:  1h 03m | Hits:  97%/71033 
      🟩 TestCPU            Pass: 100%/3   | Total:  1h 03m | Avg: 21m 14s | Max: 48m 05s | Hits:  66%/5323  
      🟩 TestGPU            Pass: 100%/4   | Total: 44m 20s | Avg: 11m 05s | Max: 11m 35s | Hits:  99%/7107  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 16m 01s | Avg:  8m 00s | Max: 10m 54s | Hits:  99%/3554  
      🟩 90;90a;100         Pass: 100%/1   | Total:  5m 56s | Avg:  5m 56s | Max:  5m 56s | Hits:  99%/1777  
    🟩 std
      🟩 17                 Pass: 100%/21  | Total:  5h 07m | Avg: 14m 38s | Max:  1h 03m | Hits:  97%/37287 
      🟩 20                 Pass: 100%/24  | Total:  4h 50m | Avg: 12m 06s | Max: 58m 24s | Hits:  94%/42622 
    
  • 🟩 stdpar: Pass: 100%/4 | Total: 19m 41s | Avg: 4m 55s | Max: 5m 39s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 11m 12s | Avg:  5m 36s | Max:  5m 39s
      🟩 arm64              Pass: 100%/2   | Total:  8m 29s | Avg:  4m 14s | Max:  4m 17s
    🟩 ctk
      🟩 12.8               Pass: 100%/4   | Total: 19m 41s | Avg:  4m 55s | Max:  5m 39s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/4   | Total: 19m 41s | Avg:  4m 55s | Max:  5m 39s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 19m 41s | Avg:  4m 55s | Max:  5m 39s
    🟩 cxx
      🟩 NVHPC25.3          Pass: 100%/4   | Total: 19m 41s | Avg:  4m 55s | Max:  5m 39s
    🟩 cxx_family
      🟩 NVHPC              Pass: 100%/4   | Total: 19m 41s | Avg:  4m 55s | Max:  5m 39s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 19m 41s | Avg:  4m 55s | Max:  5m 39s
    🟩 jobs
      🟩 Build              Pass: 100%/4   | Total: 19m 41s | Avg:  4m 55s | Max:  5m 39s
    🟩 std
      🟩 17                 Pass: 100%/2   | Total:  9m 51s | Avg:  4m 55s | Max:  5m 39s
      🟩 20                 Pass: 100%/2   | Total:  9m 50s | Avg:  4m 55s | Max:  5m 33s
    
  • 🟩 python: Pass: 100%/3 | Total: 29m 39s | Avg: 9m 53s | Max: 21m 04s

    🟩 cpu
      🟩 amd64              Pass: 100%/3   | Total: 29m 39s | Avg:  9m 53s | Max: 21m 04s
    🟩 ctk
      🟩 12.8               Pass: 100%/3   | Total: 29m 39s | Avg:  9m 53s | Max: 21m 04s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/3   | Total: 29m 39s | Avg:  9m 53s | Max: 21m 04s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/3   | Total: 29m 39s | Avg:  9m 53s | Max: 21m 04s
    🟩 cxx
      🟩 GCC13              Pass: 100%/3   | Total: 29m 39s | Avg:  9m 53s | Max: 21m 04s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/3   | Total: 29m 39s | Avg:  9m 53s | Max: 21m 04s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/3   | Total: 29m 39s | Avg:  9m 53s | Max: 21m 04s
    🟩 jobs
      🟩 cuda.cccl          Pass: 100%/1   | Total:  2m 54s | Avg:  2m 54s | Max:  2m 54s
      🟩 cuda.cooperative   Pass: 100%/1   | Total: 21m 04s | Avg: 21m 04s | Max: 21m 04s
      🟩 cuda.parallel      Pass: 100%/1   | Total:  5m 41s | Avg:  5m 41s | Max:  5m 41s
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 24m 55s | Avg: 12m 27s | Max: 22m 38s | Hits: 98%/328

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 24m 55s | Avg: 12m 27s | Max: 22m 38s | Hits:  98%/328   
    🟩 ctk
      🟩 12.8               Pass: 100%/2   | Total: 24m 55s | Avg: 12m 27s | Max: 22m 38s | Hits:  98%/328   
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/2   | Total: 24m 55s | Avg: 12m 27s | Max: 22m 38s | Hits:  98%/328   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 24m 55s | Avg: 12m 27s | Max: 22m 38s | Hits:  98%/328   
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 24m 55s | Avg: 12m 27s | Max: 22m 38s | Hits:  98%/328   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 24m 55s | Avg: 12m 27s | Max: 22m 38s | Hits:  98%/328   
    🟩 gpu
      🟩 rtx2080            Pass: 100%/2   | Total: 24m 55s | Avg: 12m 27s | Max: 22m 38s | Hits:  98%/328   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 17s | Avg:  2m 17s | Max:  2m 17s | Hits:  98%/164   
      🟩 Test               Pass: 100%/1   | Total: 22m 38s | Avg: 22m 38s | Max: 22m 38s | Hits:  98%/164   
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- stdpar
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 103)

# Runner
72 linux-amd64-cpu16
9 windows-amd64-cpu16
6 linux-arm64-cpu16
6 linux-amd64-gpu-rtxa6000-latest-1
4 linux-amd64-gpu-rtx2080-latest-1
3 linux-amd64-gpu-h100-latest-1
3 linux-amd64-gpu-rtx4090-latest-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants