[FEA]: Introduce Python module with CCCL headers #3201

rwgk · 2024-12-19T19:24:52Z

Description

Factor out cuda-cccl pip-installable package (cccl/python/cuda_cccl) with CCCL headers from cuda-cooperative and cuda-parallel.
- Fixes a bug: pip install works as expected in one pass. Resolves this.
Factor out cuda.cccl.include_paths from cuda.cooperative.experimental._nvrtc and cuda.parallel.experimental._bindings.
Comprehensive pyproject.toml, setup.py modernization in cuda_cccl, cuda_cooperative, cuda_parallel.
os.path -> pathlib modernization in all .py files touched by this PR.
Revise ci/test_python.sh so that cuda_cooperative and cuda_parallel testing is completely independent (they now run in fresh virtual environments).
Add pre-commit TOML format — taplo-pre-commit — TOML lint worked interactively but not in the CCCL CI (see [FEA]: Enable taplo-lint pre-commit #3426).

Currently cuda-cccl is not published on PyPI. For interactive development, this is the recommended workflow:

cd python/cuda_cooperative  # or cuda_parallel
pip install --editable ../cuda_cccl                                                    
pip install --editable .[test]                                                         
pytest -v ./tests/

I.e. by installing cuda-cccl first, the dependency is satisfied when running pip install in cuda_cooperative or cuda_parallel.

However, CI testing (ci/test_python.sh) uses an alternative approach, to ensure that we're not accidentally removing cuda-cccl from the dependencies in {cuda_cooperative,cuda_parallel}/pyproject.toml. Concretely, pip install is only run once, using the pip install --constraint option:

echo 'cuda-cccl @ file:///home/coder/cccl/python/cuda_cccl' > /tmp/cuda-cccl_constraints.txt
pip install --constraint /tmp/cuda-cccl_constraints.txt .[test]

This will fail if cuda-cccl is missing in the dependencies.

Note for completeness:

I spent a significant amount of time trying to use hatchling as the build backend (instead of setuptools):

61637d6

With that commit, pip install worked, but pip install --editable did not. The root cause is this file installed by cuda-python:

https://github.com/NVIDIA/cuda-python/blob/54a93e677b9ba5980dff24c649aa61e004f64416/cuda_bindings/cuda/__init__.py

This file interferes with Python's Implicit Namespace Packages mechanism, which is what hatchling relies on in --editable mode.

…cooperative

copy-pr-bot · 2024-12-19T19:24:55Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

python/cuda_cccl/setup.py

leofang · 2024-12-19T19:47:54Z

pip install --editable does not work (anymore).

Q: In what way is it not working?

rwgk · 2024-12-19T19:54:14Z

pip install --editable does not work (anymore).

Q: In what way is it not working?

It is getting a non-existing path here:

cccl/python/cuda_parallel/cuda/parallel/experimental/_bindings.py

Line 57 in d6253b5

cub_include_path = str(f.parent / "_include")

At HEAD, cuda_paralleld/cuda/_include exists in the source directory (it is .gitignored), but with this PR, that directory no longer exists and the trick/hack I'm using in the code above fails.

rwgk · 2024-12-19T19:58:31Z

On August 30, 2014 @leofang wrote:

#2281 (comment)

Leo: Do you still recommend that we replace shutil.copytree() (in setup.py) with the cmake-based alternative?

I'm asking because that'll take this PR in a very different direction (I think).

rwgk · 2024-12-19T23:21:37Z

Logging an observation (JIC it's useful to reference this later):

With CCCL HEAD (I have @ d6253b5)

TL;DR: pip install for cuda_cooperative or cuda_parallel does not work correctly in the first pass.

@gevtushenko could this explain your "only works 50% of the time" experience?

Current working directory is cccl/python/

deactivate
rm -rf devenv
git clean -fdx cuda_cooperative/ >& /dev/null
python -m venv devenv
. devenv/bin/activate
pip install --verbose ./cuda_cooperative[test] >& ~/pip_install_cuda_cooperative_log1.txt
ls -lR devenv/lib/python3.12/site-packages/cuda/_include | wc -l
pip install --verbose ./cuda_cooperative[test] >& ~/pip_install_cuda_cooperative_log2.txt
ls -lR devenv/lib/python3.12/site-packages/cuda/_include | wc -l

The output is:

ls: cannot access 'devenv/lib/python3.12/site-packages/cuda/_include': No such file or directory
0
2076

Similarly for cuda_parallel:

deactivate
rm -rf devenv
git clean -fdx cuda_parallel/ >& /dev/null
python -m venv devenv
. devenv/bin/activate
pip install --verbose ./cuda_parallel[test] >& ~/pip_install_cuda_parallel_log1.txt
ls -lR devenv/lib/python3.12/site-packages/cuda/_include | wc -l
pip install --verbose ./cuda_parallel[test] >& ~/pip_install_cuda_parallel_log2.txt
ls -lR devenv/lib/python3.12/site-packages/cuda/_include | wc -l

Same output as above.

rwgk · 2024-12-19T23:32:37Z

Now with this PR (@ daab580)

TL;DR: Same problem (this had me really confused TBH).

deactivate
rm -rf devenv
git clean -fdx cuda_cccl/ >& /dev/null
python -m venv devenv
. devenv/bin/activate
pip install --verbose ./cuda_cccl[test] >& ~/pip_install_cuda_cccl_log1.txt
ls -lR devenv/lib/python3.12/site-packages/cuda/_include | wc -l
pip install --verbose ./cuda_cccl[test] >& ~/pip_install_cuda_cccl_log2.txt
ls -lR devenv/lib/python3.12/site-packages/cuda/_include | wc -l

Output:

ls: cannot access 'devenv/lib/python3.12/site-packages/cuda/_include': No such file or directory
0
2076

rwgk · 2024-12-19T23:37:34Z

Small summary:

pip install cuda_cooperative or cuda_parallel does not work reliably at HEAD.
Simply moving out cuda-cccl does not fix that.
But we have the bigger question: Should I even try to fix that? Or should I instead try to use CMake instead of setuptools?

rwgk · 2024-12-20T00:20:07Z

Commit ef9d5f4 makes the pip install ./cuda_cccl command work reliably.

I wouldn't be surprised if this isn't the right way of doing it, but it does work in one pass.

… cuda._include to find the include path.

rwgk · 2024-12-20T00:57:35Z

Commit bc116dc fixes the pip install --editable issue.

… (they are equivalent to the default functions)

rwgk · 2024-12-20T01:55:53Z

It turns out what I discovered the hard way was actually a known issue:

cccl/ci/test_python.sh

Lines 23 to 27 in d6253b5

    
           # Temporarily install the package twice to populate include directory as part of the first installation 
        
           # and to let manifest discover these includes during the second installation. Do not forget to remove the 
        
           # second installation after https://github.com/NVIDIA/cccl/issues/2281 is addressed. 
        
           run_command "⚙️  Pip install cuda_parallel once" pip install --force-reinstall --upgrade --target "${prefix}" .[test] 
        
           run_command "⚙️  Pip install cuda_parallel twice" pip install --force-reinstall --upgrade --target "${prefix}" .[test]

rwgk · 2024-12-20T01:57:51Z

/ok to test

github-actions · 2024-12-20T02:58:04Z

🟩 CI finished in 58m 34s: Pass: 100%/176 | Total: 1d 00h | Avg: 8m 22s | Max: 44m 12s | Hits: 99%/22510

🟩 libcudacxx: Pass: 100%/48 | Total: 7h 29m | Avg: 9m 22s | Max: 36m 14s | Hits: 98%/9814

🟩 cpu
  🟩 amd64              Pass: 100%/46  | Total:  7h 23m | Avg:  9m 37s | Max: 36m 14s | Hits:  98%/9814  
  🟩 arm64              Pass: 100%/2   | Total:  6m 50s | Avg:  3m 25s | Max:  3m 31s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total: 54m 23s | Avg:  7m 46s | Max: 19m 13s | Hits:  98%/2239  
  🟩 12.5               Pass: 100%/2   | Total: 17m 29s | Avg:  8m 44s | Max:  8m 49s
  🟩 12.6               Pass: 100%/39  | Total:  6h 18m | Avg:  9m 41s | Max: 36m 14s | Hits:  98%/7575  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 06m | Avg: 16m 44s | Max: 20m 25s
  🟩 nvcc11.1           Pass: 100%/7   | Total: 54m 23s | Avg:  7m 46s | Max: 19m 13s | Hits:  98%/2239  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 17m 29s | Avg:  8m 44s | Max:  8m 49s
  🟩 nvcc12.6           Pass: 100%/35  | Total:  5h 11m | Avg:  8m 53s | Max: 36m 14s | Hits:  98%/7575  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 06m | Avg: 16m 44s | Max: 20m 25s
  🟩 nvcc               Pass: 100%/44  | Total:  6h 22m | Avg:  8m 42s | Max: 36m 14s | Hits:  98%/9814  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 15m 17s | Avg:  3m 49s | Max:  4m 28s
  🟩 Clang10            Pass: 100%/1   | Total:  4m 45s | Avg:  4m 45s | Max:  4m 45s
  🟩 Clang11            Pass: 100%/1   | Total:  4m 10s | Avg:  4m 10s | Max:  4m 10s
  🟩 Clang12            Pass: 100%/1   | Total:  4m 14s | Avg:  4m 14s | Max:  4m 14s
  🟩 Clang13            Pass: 100%/1   | Total:  4m 24s | Avg:  4m 24s | Max:  4m 24s
  🟩 Clang14            Pass: 100%/1   | Total:  4m 02s | Avg:  4m 02s | Max:  4m 02s
  🟩 Clang15            Pass: 100%/1   | Total:  4m 38s | Avg:  4m 38s | Max:  4m 38s
  🟩 Clang16            Pass: 100%/1   | Total:  4m 33s | Avg:  4m 33s | Max:  4m 33s
  🟩 Clang17            Pass: 100%/1   | Total:  4m 14s | Avg:  4m 14s | Max:  4m 14s
  🟩 Clang18            Pass: 100%/8   | Total:  1h 49m | Avg: 13m 43s | Max: 30m 21s
  🟩 GCC6               Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  7m 46s
  🟩 GCC7               Pass: 100%/2   | Total:  6m 32s | Avg:  3m 16s | Max:  3m 28s
  🟩 GCC8               Pass: 100%/1   | Total:  3m 48s | Avg:  3m 48s | Max:  3m 48s
  🟩 GCC9               Pass: 100%/3   | Total: 21m 47s | Avg:  7m 15s | Max: 15m 38s
  🟩 GCC10              Pass: 100%/1   | Total:  3m 47s | Avg:  3m 47s | Max:  3m 47s
  🟩 GCC11              Pass: 100%/1   | Total:  3m 42s | Avg:  3m 42s | Max:  3m 42s
  🟩 GCC12              Pass: 100%/1   | Total:  3m 49s | Avg:  3m 49s | Max:  3m 49s
  🟩 GCC13              Pass: 100%/10  | Total:  2h 32m | Avg: 15m 15s | Max: 36m 14s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total:  5m 37s | Avg:  5m 37s | Max:  5m 37s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 19m 13s | Avg: 19m 13s | Max: 19m 13s | Hits:  98%/2239  
  🟩 MSVC14.29          Pass: 100%/1   | Total: 13m 12s | Avg: 13m 12s | Max: 13m 12s | Hits:  99%/2476  
  🟩 MSVC14.39          Pass: 100%/2   | Total: 27m 54s | Avg: 13m 57s | Max: 15m 08s | Hits:  98%/5099  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 17m 29s | Avg:  8m 44s | Max:  8m 49s
🟩 cxx_family
  🟩 Clang              Pass: 100%/20  | Total:  2h 40m | Avg:  8m 00s | Max: 30m 21s
  🟩 GCC                Pass: 100%/21  | Total:  3h 26m | Avg:  9m 49s | Max: 36m 14s
  🟩 Intel              Pass: 100%/1   | Total:  5m 37s | Avg:  5m 37s | Max:  5m 37s
  🟩 MSVC               Pass: 100%/4   | Total:  1h 00m | Avg: 15m 04s | Max: 19m 13s | Hits:  98%/9814  
  🟩 NVHPC              Pass: 100%/2   | Total: 17m 29s | Avg:  8m 44s | Max:  8m 49s
🟩 gpu
  🟩 v100               Pass: 100%/48  | Total:  7h 29m | Avg:  9m 22s | Max: 36m 14s | Hits:  98%/9814  
🟩 jobs
  🟩 Build              Pass: 100%/41  | Total:  4h 41m | Avg:  6m 52s | Max: 20m 25s | Hits:  98%/9814  
  🟩 NVRTC              Pass: 100%/4   | Total:  1h 52m | Avg: 28m 11s | Max: 36m 14s
  🟩 Test               Pass: 100%/2   | Total: 53m 12s | Avg: 26m 36s | Max: 30m 21s
  🟩 VerifyCodegen      Pass: 100%/1   | Total:  2m 04s | Avg:  2m 04s | Max:  2m 04s
🟩 sm
  🟩 90                 Pass: 100%/1   | Total: 14m 37s | Avg: 14m 37s | Max: 14m 37s
  🟩 90a                Pass: 100%/2   | Total: 16m 02s | Avg:  8m 01s | Max: 12m 11s
🟩 std
  🟩 11                 Pass: 100%/6   | Total: 46m 47s | Avg:  7m 47s | Max: 31m 13s
  🟩 14                 Pass: 100%/5   | Total: 59m 13s | Avg: 11m 50s | Max: 24m 18s | Hits:  98%/2239  
  🟩 17                 Pass: 100%/13  | Total:  2h 00m | Avg:  9m 18s | Max: 21m 02s | Hits:  99%/4952  
  🟩 20                 Pass: 100%/23  | Total:  3h 40m | Avg:  9m 36s | Max: 36m 14s | Hits:  98%/2623

🟩 cub: Pass: 100%/47 | Total: 7h 03m | Avg: 9m 00s | Max: 31m 24s | Hits: 99%/3124

🟩 cpu
  🟩 amd64              Pass: 100%/45  | Total:  6h 53m | Avg:  9m 11s | Max: 31m 24s | Hits:  99%/3124  
  🟩 arm64              Pass: 100%/2   | Total:  9m 47s | Avg:  4m 53s | Max:  4m 59s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total: 41m 41s | Avg:  5m 57s | Max: 15m 29s | Hits:  99%/781   
  🟩 12.5               Pass: 100%/2   | Total: 18m 36s | Avg:  9m 18s | Max:  9m 23s
  🟩 12.6               Pass: 100%/38  | Total:  6h 02m | Avg:  9m 32s | Max: 31m 24s | Hits:  99%/2343  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  8m 55s | Avg:  4m 27s | Max:  4m 36s
  🟩 nvcc11.1           Pass: 100%/7   | Total: 41m 41s | Avg:  5m 57s | Max: 15m 29s | Hits:  99%/781   
  🟩 nvcc12.5           Pass: 100%/2   | Total: 18m 36s | Avg:  9m 18s | Max:  9m 23s
  🟩 nvcc12.6           Pass: 100%/36  | Total:  5h 53m | Avg:  9m 49s | Max: 31m 24s | Hits:  99%/2343  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  8m 55s | Avg:  4m 27s | Max:  4m 36s
  🟩 nvcc               Pass: 100%/45  | Total:  6h 54m | Avg:  9m 12s | Max: 31m 24s | Hits:  99%/3124  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 22m 04s | Avg:  5m 31s | Max:  6m 25s
  🟩 Clang10            Pass: 100%/1   | Total:  7m 01s | Avg:  7m 01s | Max:  7m 01s
  🟩 Clang11            Pass: 100%/1   | Total:  5m 42s | Avg:  5m 42s | Max:  5m 42s
  🟩 Clang12            Pass: 100%/1   | Total:  5m 38s | Avg:  5m 38s | Max:  5m 38s
  🟩 Clang13            Pass: 100%/1   | Total:  5m 32s | Avg:  5m 32s | Max:  5m 32s
  🟩 Clang14            Pass: 100%/1   | Total:  5m 27s | Avg:  5m 27s | Max:  5m 27s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 51s | Avg:  5m 51s | Max:  5m 51s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 43s | Avg:  5m 43s | Max:  5m 43s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 34s | Avg:  5m 34s | Max:  5m 34s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 18m | Avg: 11m 10s | Max: 31m 24s
  🟩 GCC6               Pass: 100%/2   | Total:  8m 23s | Avg:  4m 11s | Max:  4m 24s
  🟩 GCC7               Pass: 100%/2   | Total: 10m 24s | Avg:  5m 12s | Max:  5m 13s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 30s | Avg:  5m 30s | Max:  5m 30s
  🟩 GCC9               Pass: 100%/3   | Total: 14m 11s | Avg:  4m 43s | Max:  5m 37s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 48s | Avg:  5m 48s | Max:  5m 48s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 40s | Avg:  5m 40s | Max:  5m 40s
  🟩 GCC12              Pass: 100%/3   | Total: 25m 38s | Avg:  8m 32s | Max: 15m 50s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 01m | Avg: 15m 09s | Max: 31m 13s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total:  6m 49s | Avg:  6m 49s | Max:  6m 49s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 15m 29s | Avg: 15m 29s | Max: 15m 29s | Hits:  99%/781   
  🟩 MSVC14.29          Pass: 100%/1   | Total: 12m 05s | Avg: 12m 05s | Max: 12m 05s | Hits:  99%/781   
  🟩 MSVC14.39          Pass: 100%/2   | Total: 26m 30s | Avg: 13m 15s | Max: 13m 43s | Hits:  99%/1562  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 18m 36s | Avg:  9m 18s | Max:  9m 23s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  2h 26m | Avg:  7m 43s | Max: 31m 24s
  🟩 GCC                Pass: 100%/21  | Total:  3h 16m | Avg:  9m 22s | Max: 31m 13s
  🟩 Intel              Pass: 100%/1   | Total:  6m 49s | Avg:  6m 49s | Max:  6m 49s
  🟩 MSVC               Pass: 100%/4   | Total: 54m 04s | Avg: 13m 31s | Max: 15m 29s | Hits:  99%/3124  
  🟩 NVHPC              Pass: 100%/2   | Total: 18m 36s | Avg:  9m 18s | Max:  9m 23s
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 20m 01s | Avg: 10m 00s | Max: 15m 50s
  🟩 v100               Pass: 100%/45  | Total:  6h 43m | Avg:  8m 57s | Max: 31m 24s | Hits:  99%/3124  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  4h 13m | Avg:  6m 20s | Max: 15m 29s | Hits:  99%/3124  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 24m 41s | Avg: 24m 41s | Max: 24m 41s
  🟩 GraphCapture       Pass: 100%/1   | Total: 27m 30s | Avg: 27m 30s | Max: 27m 30s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 03m | Avg: 21m 13s | Max: 31m 24s
  🟩 TestGPU            Pass: 100%/2   | Total: 53m 48s | Avg: 26m 54s | Max: 31m 13s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 20m 01s | Avg: 10m 00s | Max: 15m 50s
  🟩 90a                Pass: 100%/1   | Total:  4m 30s | Avg:  4m 30s | Max:  4m 30s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 24m 23s | Avg:  4m 52s | Max:  6m 25s
  🟩 14                 Pass: 100%/4   | Total: 31m 28s | Avg:  7m 52s | Max: 15m 29s | Hits:  99%/781   
  🟩 17                 Pass: 100%/12  | Total:  1h 24m | Avg:  7m 03s | Max: 13m 43s | Hits:  99%/1562  
  🟩 20                 Pass: 100%/26  | Total:  4h 42m | Avg: 10m 51s | Max: 31m 24s | Hits:  99%/781

🟩 thrust: Pass: 100%/46 | Total: 6h 20m | Avg: 8m 16s | Max: 33m 45s | Hits: 99%/9260

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 18m 32s | Avg:  9m 16s | Max: 12m 27s
🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total:  6h 11m | Avg:  8m 26s | Max: 33m 45s | Hits:  99%/9260  
  🟩 arm64              Pass: 100%/2   | Total:  9m 38s | Avg:  4m 49s | Max:  5m 08s
🟩 ctk
  🟩 11.1               Pass: 100%/7   | Total: 44m 55s | Avg:  6m 25s | Max: 18m 45s | Hits:  99%/1852  
  🟩 12.5               Pass: 100%/2   | Total: 27m 55s | Avg: 13m 57s | Max: 14m 05s
  🟩 12.6               Pass: 100%/37  | Total:  5h 07m | Avg:  8m 19s | Max: 33m 45s | Hits:  99%/7408  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  5m 07s
  🟩 nvcc11.1           Pass: 100%/7   | Total: 44m 55s | Avg:  6m 25s | Max: 18m 45s | Hits:  99%/1852  
  🟩 nvcc12.5           Pass: 100%/2   | Total: 27m 55s | Avg: 13m 57s | Max: 14m 05s
  🟩 nvcc12.6           Pass: 100%/35  | Total:  4h 57m | Avg:  8m 30s | Max: 33m 45s | Hits:  99%/7408  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  5m 07s
  🟩 nvcc               Pass: 100%/44  | Total:  6h 10m | Avg:  8m 25s | Max: 33m 45s | Hits:  99%/9260  
🟩 cxx
  🟩 Clang9             Pass: 100%/4   | Total: 21m 19s | Avg:  5m 19s | Max:  6m 25s
  🟩 Clang10            Pass: 100%/1   | Total:  6m 42s | Avg:  6m 42s | Max:  6m 42s
  🟩 Clang11            Pass: 100%/1   | Total:  5m 15s | Avg:  5m 15s | Max:  5m 15s
  🟩 Clang12            Pass: 100%/1   | Total:  5m 23s | Avg:  5m 23s | Max:  5m 23s
  🟩 Clang13            Pass: 100%/1   | Total:  5m 45s | Avg:  5m 45s | Max:  5m 45s
  🟩 Clang14            Pass: 100%/1   | Total:  5m 10s | Avg:  5m 10s | Max:  5m 10s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 46s | Avg:  5m 46s | Max:  5m 46s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 20s | Avg:  5m 20s | Max:  5m 20s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 32s | Avg:  5m 32s | Max:  5m 32s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 07m | Avg:  9m 36s | Max: 33m 45s
  🟩 GCC6               Pass: 100%/2   | Total:  8m 15s | Avg:  4m 07s | Max:  4m 09s
  🟩 GCC7               Pass: 100%/2   | Total:  9m 25s | Avg:  4m 42s | Max:  5m 01s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 00s | Avg:  5m 00s | Max:  5m 00s
  🟩 GCC9               Pass: 100%/3   | Total: 14m 45s | Avg:  4m 55s | Max:  5m 54s
  🟩 GCC10              Pass: 100%/1   | Total:  5m 42s | Avg:  5m 42s | Max:  5m 42s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 40s | Avg:  5m 40s | Max:  5m 40s
  🟩 GCC12              Pass: 100%/1   | Total:  6m 17s | Avg:  6m 17s | Max:  6m 17s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 05m | Avg:  8m 09s | Max: 17m 03s
  🟩 Intel2023.2.0      Pass: 100%/1   | Total:  6m 58s | Avg:  6m 58s | Max:  6m 58s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 45s | Avg: 18m 45s | Max: 18m 45s | Hits:  99%/1852  
  🟩 MSVC14.29          Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s | Hits:  99%/1852  
  🟩 MSVC14.39          Pass: 100%/3   | Total: 58m 19s | Avg: 19m 26s | Max: 23m 53s | Hits:  99%/5556  
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 27m 55s | Avg: 13m 57s | Max: 14m 05s
🟩 cxx_family
  🟩 Clang              Pass: 100%/19  | Total:  2h 13m | Avg:  7m 01s | Max: 33m 45s
  🟩 GCC                Pass: 100%/19  | Total:  2h 00m | Avg:  6m 19s | Max: 17m 03s
  🟩 Intel              Pass: 100%/1   | Total:  6m 58s | Avg:  6m 58s | Max:  6m 58s
  🟩 MSVC               Pass: 100%/5   | Total:  1h 32m | Avg: 18m 25s | Max: 23m 53s | Hits:  99%/9260  
  🟩 NVHPC              Pass: 100%/2   | Total: 27m 55s | Avg: 13m 57s | Max: 14m 05s
🟩 gpu
  🟩 v100               Pass: 100%/46  | Total:  6h 20m | Avg:  8m 16s | Max: 33m 45s | Hits:  99%/9260  
🟩 jobs
  🟩 Build              Pass: 100%/40  | Total:  4h 37m | Avg:  6m 56s | Max: 18m 45s | Hits:  99%/7408  
  🟩 TestCPU            Pass: 100%/3   | Total: 39m 40s | Avg: 13m 13s | Max: 23m 53s | Hits:  99%/1852  
  🟩 TestGPU            Pass: 100%/3   | Total:  1h 03m | Avg: 21m 05s | Max: 33m 45s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 28s | Avg:  4m 28s | Max:  4m 28s
🟩 std
  🟩 11                 Pass: 100%/5   | Total: 22m 32s | Avg:  4m 30s | Max:  5m 50s
  🟩 14                 Pass: 100%/4   | Total: 34m 17s | Avg:  8m 34s | Max: 18m 45s | Hits:  99%/1852  
  🟩 17                 Pass: 100%/12  | Total:  1h 36m | Avg:  8m 03s | Max: 17m 03s | Hits:  99%/3704  
  🟩 20                 Pass: 100%/23  | Total:  3h 28m | Avg:  9m 04s | Max: 33m 45s | Hits:  99%/3704

🟩 cudax: Pass: 100%/26 | Total: 2h 18m | Avg: 5m 19s | Max: 31m 25s | Hits: 92%/312

🟩 cpu
  🟩 amd64              Pass: 100%/22  | Total:  2h 07m | Avg:  5m 48s | Max: 31m 25s | Hits:  92%/312   
  🟩 arm64              Pass: 100%/4   | Total: 10m 21s | Avg:  2m 35s | Max:  2m 39s
🟩 ctk
  🟩 12.0               Pass: 100%/3   | Total: 14m 27s | Avg:  4m 49s | Max:  8m 33s | Hits:  92%/156   
  🟩 12.5               Pass: 100%/2   | Total: 10m 57s | Avg:  5m 28s | Max:  5m 40s
  🟩 12.6               Pass: 100%/21  | Total:  1h 52m | Avg:  5m 22s | Max: 31m 25s | Hits:  92%/156   
🟩 cudacxx
  🟩 nvcc12.0           Pass: 100%/3   | Total: 14m 27s | Avg:  4m 49s | Max:  8m 33s | Hits:  92%/156   
  🟩 nvcc12.5           Pass: 100%/2   | Total: 10m 57s | Avg:  5m 28s | Max:  5m 40s
  🟩 nvcc12.6           Pass: 100%/21  | Total:  1h 52m | Avg:  5m 22s | Max: 31m 25s | Hits:  92%/156   
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/26  | Total:  2h 18m | Avg:  5m 19s | Max: 31m 25s | Hits:  92%/312   
🟩 cxx
  🟩 Clang9             Pass: 100%/1   | Total:  3m 03s | Avg:  3m 03s | Max:  3m 03s
  🟩 Clang10            Pass: 100%/1   | Total:  4m 03s | Avg:  4m 03s | Max:  4m 03s
  🟩 Clang11            Pass: 100%/1   | Total:  3m 13s | Avg:  3m 13s | Max:  3m 13s
  🟩 Clang12            Pass: 100%/1   | Total:  3m 07s | Avg:  3m 07s | Max:  3m 07s
  🟩 Clang13            Pass: 100%/1   | Total:  3m 17s | Avg:  3m 17s | Max:  3m 17s
  🟩 Clang14            Pass: 100%/1   | Total:  3m 33s | Avg:  3m 33s | Max:  3m 33s
  🟩 Clang15            Pass: 100%/1   | Total:  3m 14s | Avg:  3m 14s | Max:  3m 14s
  🟩 Clang16            Pass: 100%/1   | Total:  3m 12s | Avg:  3m 12s | Max:  3m 12s
  🟩 Clang17            Pass: 100%/1   | Total:  3m 16s | Avg:  3m 16s | Max:  3m 16s
  🟩 Clang18            Pass: 100%/4   | Total: 39m 45s | Avg:  9m 56s | Max: 31m 25s
  🟩 GCC9               Pass: 100%/1   | Total:  2m 51s | Avg:  2m 51s | Max:  2m 51s
  🟩 GCC10              Pass: 100%/1   | Total:  3m 23s | Avg:  3m 23s | Max:  3m 23s
  🟩 GCC11              Pass: 100%/1   | Total:  3m 03s | Avg:  3m 03s | Max:  3m 03s
  🟩 GCC12              Pass: 100%/2   | Total: 20m 10s | Avg: 10m 05s | Max: 16m 46s
  🟩 GCC13              Pass: 100%/4   | Total: 10m 56s | Avg:  2m 44s | Max:  2m 58s
  🟩 MSVC14.36          Pass: 100%/1   | Total:  8m 33s | Avg:  8m 33s | Max:  8m 33s | Hits:  92%/156   
  🟩 MSVC14.39          Pass: 100%/1   | Total:  8m 39s | Avg:  8m 39s | Max:  8m 39s | Hits:  92%/156   
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 10m 57s | Avg:  5m 28s | Max:  5m 40s
🟩 cxx_family
  🟩 Clang              Pass: 100%/13  | Total:  1h 09m | Avg:  5m 21s | Max: 31m 25s
  🟩 GCC                Pass: 100%/9   | Total: 40m 23s | Avg:  4m 29s | Max: 16m 46s
  🟩 MSVC               Pass: 100%/2   | Total: 17m 12s | Avg:  8m 36s | Max:  8m 39s | Hits:  92%/312   
  🟩 NVHPC              Pass: 100%/2   | Total: 10m 57s | Avg:  5m 28s | Max:  5m 40s
🟩 gpu
  🟩 v100               Pass: 100%/26  | Total:  2h 18m | Avg:  5m 19s | Max: 31m 25s | Hits:  92%/312   
🟩 jobs
  🟩 Build              Pass: 100%/24  | Total:  1h 30m | Avg:  3m 45s | Max:  8m 39s | Hits:  92%/312   
  🟩 Test               Pass: 100%/2   | Total: 48m 11s | Avg: 24m 05s | Max: 31m 25s
🟩 sm
  🟩 90                 Pass: 100%/1   | Total:  2m 58s | Avg:  2m 58s | Max:  2m 58s
  🟩 90a                Pass: 100%/1   | Total:  2m 45s | Avg:  2m 45s | Max:  2m 45s
🟩 std
  🟩 17                 Pass: 100%/6   | Total: 19m 42s | Avg:  3m 17s | Max:  5m 40s
  🟩 20                 Pass: 100%/20  | Total:  1h 58m | Avg:  5m 55s | Max: 31m 25s | Hits:  92%/312

🟩 cccl: Pass: 100%/6 | Total: 27m 15s | Avg: 4m 32s | Max: 4m 51s

🟩 cpu
  🟩 amd64              Pass: 100%/6   | Total: 27m 15s | Avg:  4m 32s | Max:  4m 51s
🟩 ctk
  🟩 11.1               Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  4m 51s
  🟩 12.0               Pass: 100%/2   | Total:  9m 26s | Avg:  4m 43s | Max:  4m 45s
  🟩 12.6               Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  4m 49s
🟩 cudacxx
  🟩 nvcc11.1           Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  4m 51s
  🟩 nvcc12.0           Pass: 100%/2   | Total:  9m 26s | Avg:  4m 43s | Max:  4m 45s
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 02s | Avg:  4m 31s | Max:  4m 49s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/6   | Total: 27m 15s | Avg:  4m 32s | Max:  4m 51s
🟩 cxx
  🟩 Clang9             Pass: 100%/1   | Total:  4m 51s | Avg:  4m 51s | Max:  4m 51s
  🟩 Clang14            Pass: 100%/1   | Total:  4m 41s | Avg:  4m 41s | Max:  4m 41s
  🟩 Clang18            Pass: 100%/1   | Total:  4m 49s | Avg:  4m 49s | Max:  4m 49s
  🟩 GCC6               Pass: 100%/1   | Total:  3m 56s | Avg:  3m 56s | Max:  3m 56s
  🟩 GCC12              Pass: 100%/1   | Total:  4m 45s | Avg:  4m 45s | Max:  4m 45s
  🟩 GCC13              Pass: 100%/1   | Total:  4m 13s | Avg:  4m 13s | Max:  4m 13s
🟩 cxx_family
  🟩 Clang              Pass: 100%/3   | Total: 14m 21s | Avg:  4m 47s | Max:  4m 51s
  🟩 GCC                Pass: 100%/3   | Total: 12m 54s | Avg:  4m 18s | Max:  4m 45s
🟩 gpu
  🟩 v100               Pass: 100%/6   | Total: 27m 15s | Avg:  4m 32s | Max:  4m 51s
🟩 jobs
  🟩 Infra              Pass: 100%/6   | Total: 27m 15s | Avg:  4m 32s | Max:  4m 51s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 04s | Avg: 5m 02s | Max: 8m 04s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  8m 04s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  8m 04s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  8m 04s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  8m 04s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  8m 04s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  8m 04s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 10m 04s | Avg:  5m 02s | Max:  8m 04s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 00s | Avg:  2m 00s | Max:  2m 00s
  🟩 Test               Pass: 100%/1   | Total:  8m 04s | Avg:  8m 04s | Max:  8m 04s

🟩 python: Pass: 100%/1 | Total: 44m 12s | Avg: 44m 12s | Max: 44m 12s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 44m 12s | Avg: 44m 12s | Max: 44m 12s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 44m 12s | Avg: 44m 12s | Max: 44m 12s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 44m 12s | Avg: 44m 12s | Max: 44m 12s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 44m 12s | Avg: 44m 12s | Max: 44m 12s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 44m 12s | Avg: 44m 12s | Max: 44m 12s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 44m 12s | Avg: 44m 12s | Max: 44m 12s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 44m 12s | Avg: 44m 12s | Max: 44m 12s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 44m 12s | Avg: 44m 12s | Max: 44m 12s

👃 Inspect Changes

Modifications in project?

	Project
+/-	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
+/-	CCCL Infrastructure
+/-	libcu++
+/-	CUB
+/-	Thrust
+/-	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 176)

#	Runner
125	`linux-amd64-cpu16`
25	`linux-amd64-gpu-v100-latest-1`
15	`windows-amd64-cpu16`
10	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

…docs][skip-vdc][skip pre-commit.ci]

…skip-docs][skip-vdc]

…kip-vdc][skip pre-commit.ci]" This reverts commit ec206fd.

python/cuda_cccl/cuda/cccl/__init__.py

leofang

We're in a good shape now! A few minor comments

leofang · 2025-01-16T18:54:11Z

python/cuda_cooperative/cuda/cooperative/experimental/_nvrtc.py

+    from cuda.cccl import get_include_paths
+
+    for path in get_include_paths().as_tuple():
+        if path:


I think this check should be moved to get_include_paths() so that we only pay this cost once per process? (and I think you've done that check via assert!)

I think it's better to keep the if here, because with the current setup ...

@dataclass class IncludePaths: cuda: Optional[Path] libcudacxx: Optional[Path] cub: Optional[Path] thrust: Optional[Path]

... it's safer. The Optional here are for flexibility/reusability/future-proofing.

Possibly, in the future some of the paths will be None.

I expect the runtime overhead (the price we pay for the flexibility) to be unmeasurable, especially because this function is cached, but even without caching.

However, I changed it to if path is None: (commit 12dbf29), for consistency, after I just realized that that's what we have in python/cuda_parallel/cuda/parallel/experimental/_bindings.py.

python/cuda_cooperative/pyproject.toml

python/cuda_cooperative/setup.py

leofang · 2025-01-16T18:57:42Z

python/cuda_parallel/cuda/parallel/experimental/_bindings.py

-            libcudacxx_include_path,
-            cuda_include_path,
-        )
+        for path in get_include_paths().as_tuple()
        if path is not None


ditto, path could be checked only once in get_include_paths()

python/cuda_parallel/pyproject.toml

python/cuda_cooperative/pyproject.toml

leofang · 2025-01-16T19:00:57Z

python/cuda_cooperative/pyproject.toml

+]
+requires-python = ">=3.9"
+dependencies = [
+  "cuda-cccl",


I don't have a good way to declare version constraint for cuda-cccl statically, I suspect we will need to move dependencies to setup.py's install_requires, let us do this in another PR

github-actions · 2025-01-16T22:26:24Z

🟩 CI finished in 2h 16m: Pass: 100%/148 | Total: 1d 14h | Avg: 15m 26s | Max: 1h 32m | Hits: 455%/25823

🟩 libcudacxx: Pass: 100%/46 | Total: 9h 41m | Avg: 12m 38s | Max: 37m 03s | Hits: 668%/12541

🟩 cpu
  🟩 amd64              Pass: 100%/44  | Total:  9h 20m | Avg: 12m 44s | Max: 37m 03s | Hits: 668%/12541 
  🟩 arm64              Pass: 100%/2   | Total: 20m 40s | Avg: 10m 20s | Max: 17m 10s
🟩 ctk
  🟩 12.0               Pass: 100%/8   | Total:  1h 17m | Avg:  9m 39s | Max: 21m 49s | Hits: 649%/4895  
  🟩 12.5               Pass: 100%/2   | Total:  1h 07m | Avg: 33m 42s | Max: 37m 03s
  🟩 12.6               Pass: 100%/36  | Total:  7h 16m | Avg: 12m 08s | Max: 28m 29s | Hits: 680%/7646  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 07m | Avg: 16m 50s | Max: 22m 20s
  🟩 nvcc12.0           Pass: 100%/8   | Total:  1h 17m | Avg:  9m 39s | Max: 21m 49s | Hits: 649%/4895  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 07m | Avg: 33m 42s | Max: 37m 03s
  🟩 nvcc12.6           Pass: 100%/32  | Total:  6h 09m | Avg: 11m 32s | Max: 28m 29s | Hits: 680%/7646  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 07m | Avg: 16m 50s | Max: 22m 20s
  🟩 nvcc               Pass: 100%/42  | Total:  8h 34m | Avg: 12m 14s | Max: 37m 03s | Hits: 668%/12541 
🟩 cxx
  🟩 Clang14            Pass: 100%/6   | Total: 37m 29s | Avg:  6m 14s | Max: 16m 04s
  🟩 Clang15            Pass: 100%/1   | Total:  4m 31s | Avg:  4m 31s | Max:  4m 31s
  🟩 Clang16            Pass: 100%/1   | Total:  6m 08s | Avg:  6m 08s | Max:  6m 08s
  🟩 Clang17            Pass: 100%/1   | Total:  4m 45s | Avg:  4m 45s | Max:  4m 45s
  🟩 Clang18            Pass: 100%/8   | Total:  1h 52m | Avg: 14m 01s | Max: 22m 20s
  🟩 GCC7               Pass: 100%/5   | Total: 31m 44s | Avg:  6m 20s | Max: 17m 24s
  🟩 GCC8               Pass: 100%/1   | Total:  4m 00s | Avg:  4m 00s | Max:  4m 00s
  🟩 GCC9               Pass: 100%/3   | Total: 40m 34s | Avg: 13m 31s | Max: 20m 25s
  🟩 GCC10              Pass: 100%/1   | Total:  3m 54s | Avg:  3m 54s | Max:  3m 54s
  🟩 GCC11              Pass: 100%/1   | Total:  3m 45s | Avg:  3m 45s | Max:  3m 45s
  🟩 GCC12              Pass: 100%/1   | Total:  3m 58s | Avg:  3m 58s | Max:  3m 58s
  🟩 GCC13              Pass: 100%/10  | Total:  2h 21m | Avg: 14m 09s | Max: 28m 29s
  🟩 MSVC14.29          Pass: 100%/3   | Total:  1h 08m | Avg: 22m 41s | Max: 24m 57s | Hits: 659%/7393  
  🟩 MSVC14.39          Pass: 100%/2   | Total: 51m 32s | Avg: 25m 46s | Max: 26m 48s | Hits: 681%/5148  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 07m | Avg: 33m 42s | Max: 37m 03s
🟩 cxx_family
  🟩 Clang              Pass: 100%/17  | Total:  2h 45m | Avg:  9m 42s | Max: 22m 20s
  🟩 GCC                Pass: 100%/22  | Total:  3h 49m | Avg: 10m 25s | Max: 28m 29s
  🟩 MSVC               Pass: 100%/5   | Total:  1h 59m | Avg: 23m 55s | Max: 26m 48s | Hits: 668%/12541 
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 07m | Avg: 33m 42s | Max: 37m 03s
🟩 gpu
  🟩 v100               Pass: 100%/46  | Total:  9h 41m | Avg: 12m 38s | Max: 37m 03s | Hits: 668%/12541 
🟩 jobs
  🟩 Build              Pass: 100%/39  | Total:  7h 16m | Avg: 11m 11s | Max: 37m 03s | Hits: 668%/12541 
  🟩 NVRTC              Pass: 100%/4   | Total:  1h 35m | Avg: 23m 57s | Max: 26m 41s
  🟩 Test               Pass: 100%/2   | Total: 47m 17s | Avg: 23m 38s | Max: 28m 29s
  🟩 VerifyCodegen      Pass: 100%/1   | Total:  2m 05s | Avg:  2m 05s | Max:  2m 05s
🟩 sm
  🟩 90                 Pass: 100%/1   | Total: 12m 26s | Avg: 12m 26s | Max: 12m 26s
  🟩 90a                Pass: 100%/2   | Total: 17m 48s | Avg:  8m 54s | Max: 13m 53s
🟩 std
  🟩 11                 Pass: 100%/6   | Total: 51m 44s | Avg:  8m 37s | Max: 21m 51s
  🟩 14                 Pass: 100%/4   | Total:  1h 11m | Avg: 17m 49s | Max: 26m 41s | Hits: 615%/2407  
  🟩 17                 Pass: 100%/14  | Total:  3h 30m | Avg: 15m 03s | Max: 37m 03s | Hits: 681%/7484  
  🟩 20                 Pass: 100%/21  | Total:  4h 05m | Avg: 11m 41s | Max: 30m 21s | Hits: 680%/2650

🟩 cub: Pass: 100%/38 | Total: 15h 18m | Avg: 24m 10s | Max: 1h 32m | Hits: 307%/3540

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total: 14h 59m | Avg: 24m 59s | Max:  1h 32m | Hits: 307%/3540  
  🟩 arm64              Pass: 100%/2   | Total: 18m 52s | Avg:  9m 26s | Max:  9m 46s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  1h 35m | Avg: 19m 01s | Max: 59m 33s | Hits: 308%/885   
  🟩 12.5               Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 02m
  🟩 12.6               Pass: 100%/31  | Total: 11h 37m | Avg: 22m 30s | Max:  1h 32m | Hits: 307%/2655  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 14s | Avg:  4m 37s | Max:  4m 41s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 35m | Avg: 19m 01s | Max: 59m 33s | Hits: 308%/885   
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 02m
  🟩 nvcc12.6           Pass: 100%/29  | Total: 11h 28m | Avg: 23m 44s | Max:  1h 32m | Hits: 307%/2655  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 14s | Avg:  4m 37s | Max:  4m 41s
  🟩 nvcc               Pass: 100%/36  | Total: 15h 09m | Avg: 25m 15s | Max:  1h 32m | Hits: 307%/3540  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 36m 21s | Avg:  9m 05s | Max:  9m 27s
  🟩 Clang15            Pass: 100%/1   | Total:  9m 28s | Avg:  9m 28s | Max:  9m 28s
  🟩 Clang16            Pass: 100%/1   | Total:  9m 07s | Avg:  9m 07s | Max:  9m 07s
  🟩 Clang17            Pass: 100%/1   | Total:  9m 33s | Avg:  9m 33s | Max:  9m 33s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 37m | Avg: 13m 57s | Max: 36m 53s
  🟩 GCC7               Pass: 100%/2   | Total: 17m 23s | Avg:  8m 41s | Max:  9m 01s
  🟩 GCC8               Pass: 100%/1   | Total:  8m 55s | Avg:  8m 55s | Max:  8m 55s
  🟩 GCC9               Pass: 100%/2   | Total: 18m 38s | Avg:  9m 19s | Max:  9m 50s
  🟩 GCC10              Pass: 100%/1   | Total:  9m 20s | Avg:  9m 20s | Max:  9m 20s
  🟩 GCC11              Pass: 100%/1   | Total:  8m 50s | Avg:  8m 50s | Max:  8m 50s
  🟩 GCC12              Pass: 100%/3   | Total: 35m 19s | Avg: 11m 46s | Max: 19m 34s
  🟩 GCC13              Pass: 100%/8   | Total:  4h 22m | Avg: 32m 49s | Max:  1h 32m
  🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 09m | Avg:  1h 04m | Max:  1h 09m | Hits: 308%/1770  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 13m | Hits: 307%/1770  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 02m
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  2h 42m | Avg: 11m 35s | Max: 36m 53s
  🟩 GCC                Pass: 100%/18  | Total:  6h 01m | Avg: 20m 03s | Max:  1h 32m
  🟩 MSVC               Pass: 100%/4   | Total:  4h 29m | Avg:  1h 07m | Max:  1h 13m | Hits: 307%/3540  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 02m
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 25m 33s | Avg: 12m 46s | Max: 19m 34s
  🟩 v100               Pass: 100%/36  | Total: 14h 52m | Avg: 24m 48s | Max:  1h 32m | Hits: 307%/3540  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total: 10h 57m | Avg: 21m 12s | Max:  1h 13m | Hits: 307%/3540  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 25s | Avg: 22m 25s | Max: 22m 25s
  🟩 GraphCapture       Pass: 100%/1   | Total: 38m 50s | Avg: 38m 50s | Max: 38m 50s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 10m | Avg: 23m 27s | Max: 26m 54s
  🟩 TestGPU            Pass: 100%/2   | Total:  2h 09m | Avg:  1h 04m | Max:  1h 32m
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 25m 33s | Avg: 12m 46s | Max: 19m 34s
  🟩 90a                Pass: 100%/1   | Total:  5m 52s | Avg:  5m 52s | Max:  5m 52s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  6h 38m | Avg: 28m 28s | Max:  1h 13m | Hits: 308%/2655  
  🟩 20                 Pass: 100%/24  | Total:  8h 39m | Avg: 21m 39s | Max:  1h 32m | Hits: 306%/885

🟩 thrust: Pass: 100%/37 | Total: 9h 48m | Avg: 15m 53s | Max: 53m 48s | Hits: 226%/9220

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 27m 31s | Avg: 13m 45s | Max: 21m 25s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total:  9h 38m | Avg: 16m 31s | Max: 53m 48s | Hits: 226%/9220  
  🟩 arm64              Pass: 100%/2   | Total:  9m 53s | Avg:  4m 56s | Max:  5m 07s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  1h 04m | Avg: 12m 59s | Max: 45m 42s | Hits: 191%/1844  
  🟩 12.5               Pass: 100%/2   | Total:  1h 46m | Avg: 53m 25s | Max: 53m 48s
  🟩 12.6               Pass: 100%/30  | Total:  6h 56m | Avg: 13m 52s | Max: 51m 58s | Hits: 234%/7376  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  5m 02s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 04m | Avg: 12m 59s | Max: 45m 42s | Hits: 191%/1844  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 46m | Avg: 53m 25s | Max: 53m 48s
  🟩 nvcc12.6           Pass: 100%/28  | Total:  6h 46m | Avg: 14m 30s | Max: 51m 58s | Hits: 234%/7376  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  5m 02s
  🟩 nvcc               Pass: 100%/35  | Total:  9h 38m | Avg: 16m 31s | Max: 53m 48s | Hits: 226%/9220  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total: 20m 17s | Avg:  5m 04s | Max:  5m 26s
  🟩 Clang15            Pass: 100%/1   | Total:  5m 28s | Avg:  5m 28s | Max:  5m 28s
  🟩 Clang16            Pass: 100%/1   | Total:  5m 29s | Avg:  5m 29s | Max:  5m 29s
  🟩 Clang17            Pass: 100%/1   | Total:  5m 28s | Avg:  5m 28s | Max:  5m 28s
  🟩 Clang18            Pass: 100%/7   | Total:  1h 06m | Avg:  9m 33s | Max: 33m 33s
  🟩 GCC7               Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  5m 31s
  🟩 GCC8               Pass: 100%/1   | Total:  5m 26s | Avg:  5m 26s | Max:  5m 26s
  🟩 GCC9               Pass: 100%/2   | Total: 41m 13s | Avg: 20m 36s | Max: 36m 02s
  🟩 GCC10              Pass: 100%/1   | Total:  6m 01s | Avg:  6m 01s | Max:  6m 01s
  🟩 GCC11              Pass: 100%/1   | Total:  5m 36s | Avg:  5m 36s | Max:  5m 36s
  🟩 GCC12              Pass: 100%/1   | Total:  5m 43s | Avg:  5m 43s | Max:  5m 43s
  🟩 GCC13              Pass: 100%/8   | Total:  1h 16m | Avg:  9m 34s | Max: 21m 25s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 32m | Avg: 46m 27s | Max: 47m 13s | Hits: 191%/3688  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 14m | Avg: 44m 42s | Max: 51m 58s | Hits: 249%/5532  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 46m | Avg: 53m 25s | Max: 53m 48s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  1h 43m | Avg:  7m 24s | Max: 33m 33s
  🟩 GCC                Pass: 100%/16  | Total:  2h 30m | Avg:  9m 24s | Max: 36m 02s
  🟩 MSVC               Pass: 100%/5   | Total:  3h 47m | Avg: 45m 24s | Max: 51m 58s | Hits: 226%/9220  
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 46m | Avg: 53m 25s | Max: 53m 48s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total:  9h 48m | Avg: 15m 53s | Max: 53m 48s | Hits: 226%/9220  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  7h 44m | Avg: 14m 58s | Max: 53m 48s | Hits: 191%/7376  
  🟩 TestCPU            Pass: 100%/3   | Total: 50m 39s | Avg: 16m 53s | Max: 35m 09s | Hits: 365%/1844  
  🟩 TestGPU            Pass: 100%/3   | Total:  1h 13m | Avg: 24m 26s | Max: 33m 33s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total:  4m 45s | Avg:  4m 45s | Max:  4m 45s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  4h 36m | Avg: 19m 45s | Max: 53m 02s | Hits: 191%/5532  
  🟩 20                 Pass: 100%/21  | Total:  4h 44m | Avg: 13m 31s | Max: 53m 48s | Hits: 278%/3688

🟩 cudax: Pass: 100%/20 | Total: 2h 03m | Avg: 6m 11s | Max: 20m 13s | Hits: 383%/522

🟩 cpu
  🟩 amd64              Pass: 100%/16  | Total:  1h 49m | Avg:  6m 50s | Max: 20m 13s | Hits: 383%/522   
  🟩 arm64              Pass: 100%/4   | Total: 14m 23s | Avg:  3m 35s | Max:  3m 39s
🟩 ctk
  🟩 12.0               Pass: 100%/1   | Total: 11m 17s | Avg: 11m 17s | Max: 11m 17s | Hits: 383%/261   
  🟩 12.5               Pass: 100%/2   | Total: 12m 29s | Avg:  6m 14s | Max:  6m 23s
  🟩 12.6               Pass: 100%/17  | Total:  1h 39m | Avg:  5m 52s | Max: 20m 13s | Hits: 383%/261   
🟩 cudacxx
  🟩 nvcc12.0           Pass: 100%/1   | Total: 11m 17s | Avg: 11m 17s | Max: 11m 17s | Hits: 383%/261   
  🟩 nvcc12.5           Pass: 100%/2   | Total: 12m 29s | Avg:  6m 14s | Max:  6m 23s
  🟩 nvcc12.6           Pass: 100%/17  | Total:  1h 39m | Avg:  5m 52s | Max: 20m 13s | Hits: 383%/261   
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/20  | Total:  2h 03m | Avg:  6m 11s | Max: 20m 13s | Hits: 383%/522   
🟩 cxx
  🟩 Clang14            Pass: 100%/1   | Total:  3m 54s | Avg:  3m 54s | Max:  3m 54s
  🟩 Clang15            Pass: 100%/1   | Total:  3m 58s | Avg:  3m 58s | Max:  3m 58s
  🟩 Clang16            Pass: 100%/1   | Total:  4m 00s | Avg:  4m 00s | Max:  4m 00s
  🟩 Clang17            Pass: 100%/1   | Total:  3m 49s | Avg:  3m 49s | Max:  3m 49s
  🟩 Clang18            Pass: 100%/4   | Total: 26m 51s | Avg:  6m 42s | Max: 15m 39s
  🟩 GCC10              Pass: 100%/1   | Total:  4m 06s | Avg:  4m 06s | Max:  4m 06s
  🟩 GCC11              Pass: 100%/1   | Total:  3m 51s | Avg:  3m 51s | Max:  3m 51s
  🟩 GCC12              Pass: 100%/2   | Total: 24m 15s | Avg: 12m 07s | Max: 20m 13s
  🟩 GCC13              Pass: 100%/4   | Total: 13m 33s | Avg:  3m 23s | Max:  3m 39s
  🟩 MSVC14.36          Pass: 100%/1   | Total: 11m 17s | Avg: 11m 17s | Max: 11m 17s | Hits: 383%/261   
  🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 41s | Avg: 11m 41s | Max: 11m 41s | Hits: 383%/261   
  🟩 NVHPC24.7          Pass: 100%/2   | Total: 12m 29s | Avg:  6m 14s | Max:  6m 23s
🟩 cxx_family
  🟩 Clang              Pass: 100%/8   | Total: 42m 32s | Avg:  5m 19s | Max: 15m 39s
  🟩 GCC                Pass: 100%/8   | Total: 45m 45s | Avg:  5m 43s | Max: 20m 13s
  🟩 MSVC               Pass: 100%/2   | Total: 22m 58s | Avg: 11m 29s | Max: 11m 41s | Hits: 383%/522   
  🟩 NVHPC              Pass: 100%/2   | Total: 12m 29s | Avg:  6m 14s | Max:  6m 23s
🟩 gpu
  🟩 v100               Pass: 100%/20  | Total:  2h 03m | Avg:  6m 11s | Max: 20m 13s | Hits: 383%/522   
🟩 jobs
  🟩 Build              Pass: 100%/18  | Total:  1h 27m | Avg:  4m 52s | Max: 11m 41s | Hits: 383%/522   
  🟩 Test               Pass: 100%/2   | Total: 35m 52s | Avg: 17m 56s | Max: 20m 13s
🟩 sm
  🟩 90                 Pass: 100%/1   | Total:  3m 15s | Avg:  3m 15s | Max:  3m 15s
  🟩 90a                Pass: 100%/1   | Total:  3m 07s | Avg:  3m 07s | Max:  3m 07s
🟩 std
  🟩 17                 Pass: 100%/4   | Total: 16m 29s | Avg:  4m 07s | Max:  6m 06s
  🟩 20                 Pass: 100%/16  | Total:  1h 47m | Avg:  6m 42s | Max: 20m 13s | Hits: 383%/522

🟩 cccl: Pass: 100%/4 | Total: 20m 48s | Avg: 5m 12s | Max: 5m 35s

🟩 cpu
  🟩 amd64              Pass: 100%/4   | Total: 20m 48s | Avg:  5m 12s | Max:  5m 35s
🟩 ctk
  🟩 12.0               Pass: 100%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  5m 11s
  🟩 12.6               Pass: 100%/2   | Total: 10m 32s | Avg:  5m 16s | Max:  5m 35s
🟩 cudacxx
  🟩 nvcc12.0           Pass: 100%/2   | Total: 10m 16s | Avg:  5m 08s | Max:  5m 11s
  🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 32s | Avg:  5m 16s | Max:  5m 35s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/4   | Total: 20m 48s | Avg:  5m 12s | Max:  5m 35s
🟩 cxx
  🟩 Clang14            Pass: 100%/1   | Total:  5m 11s | Avg:  5m 11s | Max:  5m 11s
  🟩 Clang18            Pass: 100%/1   | Total:  5m 35s | Avg:  5m 35s | Max:  5m 35s
  🟩 GCC12              Pass: 100%/1   | Total:  5m 05s | Avg:  5m 05s | Max:  5m 05s
  🟩 GCC13              Pass: 100%/1   | Total:  4m 57s | Avg:  4m 57s | Max:  4m 57s
🟩 cxx_family
  🟩 Clang              Pass: 100%/2   | Total: 10m 46s | Avg:  5m 23s | Max:  5m 35s
  🟩 GCC                Pass: 100%/2   | Total: 10m 02s | Avg:  5m 01s | Max:  5m 05s
🟩 gpu
  🟩 v100               Pass: 100%/4   | Total: 20m 48s | Avg:  5m 12s | Max:  5m 35s
🟩 jobs
  🟩 Infra              Pass: 100%/4   | Total: 20m 48s | Avg:  5m 12s | Max:  5m 35s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 55s | Avg: 4m 57s | Max: 7m 38s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  7m 38s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  7m 38s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  7m 38s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  7m 38s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  7m 38s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  7m 38s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 55s | Avg:  4m 57s | Max:  7m 38s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 17s | Avg:  2m 17s | Max:  2m 17s
  🟩 Test               Pass: 100%/1   | Total:  7m 38s | Avg:  7m 38s | Max:  7m 38s

🟩 python: Pass: 100%/1 | Total: 42m 45s | Avg: 42m 45s | Max: 42m 45s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 42m 45s | Avg: 42m 45s | Max: 42m 45s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 42m 45s | Avg: 42m 45s | Max: 42m 45s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 42m 45s | Avg: 42m 45s | Max: 42m 45s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 42m 45s | Avg: 42m 45s | Max: 42m 45s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 42m 45s | Avg: 42m 45s | Max: 42m 45s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 42m 45s | Avg: 42m 45s | Max: 42m 45s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 42m 45s | Avg: 42m 45s | Max: 42m 45s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 42m 45s | Avg: 42m 45s | Max: 42m 45s

👃 Inspect Changes

Modifications in project?

	Project
+/-	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
+/-	CCCL Infrastructure
+/-	libcu++
+/-	CUB
+/-	Thrust
+/-	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 148)

#	Runner
98	`linux-amd64-cpu16`
23	`linux-amd64-gpu-v100-latest-1`
16	`windows-amd64-cpu16`
10	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

@shwina

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]>

@shwina

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]>

implement `add_sat` split `signed`/`unsigned` implementation, improve implementation for MSVC improve device `add_sat` implementation add `add_sat` test improve generic `add_sat` implementation for signed types implement `sub_sat` allow more msvc intrinsics on x86 add op tests partially implement `mul_sat` implement `div_sat` and `saturate_cast` add `saturate_cast` test simplify `div_sat` test Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop CUB's util_compiler.cuh (#3302) All contained macros were deprecated Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 move to c++17, finalize device optimization fix msvc compilation, update tests Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 Fix CI issues (#3443) update docs fix review restrict allowed types replace constexpr implementations with generic optimize `__is_arithmetic_integral`

@shwina

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]>

@shwina

update docs update docs add `memcmp`, `memmove` and `memchr` implementations implement tests Use cuda::std::min/max in Thrust (NVIDIA#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (NVIDIA#2773) Deprecate thrust::null_type (NVIDIA#3367) Deprecate cub::DeviceSpmv (NVIDIA#3320) Fixes: NVIDIA#896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (NVIDIA#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (NVIDIA#3385) cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (NVIDIA#3366) Replace `typedef` with `using` in libcu++ (NVIDIA#3368) Deprecate thrust::optional (NVIDIA#3307) Fixes: NVIDIA#3306 Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 refactor `<cuda/std/cstdint>` (NVIDIA#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (NVIDIA#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes NVIDIA#3411 Extend CUB reduce benchmarks (NVIDIA#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: NVIDIA#3283 Update upload-pages-artifact to v3 (NVIDIA#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421) `std::linalg` accessors and `transposed_layout` (NVIDIA#2962) Add round up/down to multiple (NVIDIA#3234) [FEA]: Introduce Python module with CCCL headers (NVIDIA#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Fix CI issues (NVIDIA#3443) Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436) Default transform_iterator's copy ctor (NVIDIA#3395) Fixes: NVIDIA#2393 Turn C++ dialect warning into error (NVIDIA#3453) Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437) * uses thrust's dynamic dispatch for merge_sort * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Refactor allocator handling of contiguous_storage (NVIDIA#3050) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <[email protected]> Improve docs of std headers (NVIDIA#3416) Drop C++11 and C++14 support for all of cccl (NVIDIA#3417) * Drop C++11 and C++14 support for all of cccl --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> Deprecate a few CUB macros (NVIDIA#3456) Deprecate thrust universal iterator categories (NVIDIA#3461) Fix launch args order (NVIDIA#3465) Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432) add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433) Drop universal iterator categories (NVIDIA#3474) Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <[email protected]> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 Refactor CUB's util_debug (NVIDIA#3345)

@shwina

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]>

@shwina

* [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under #3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * #3201 (comment) * #3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * #3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: #3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> * cuda.parallel: invoke pytest directly rather than via `python -m pytest` (#3523) Co-authored-by: Ashwin Srinath <[email protected]> * Copy file from PR #3547 (bugfix/drop_pipe_in_lit by @wmaxey) * Revert "cuda.parallel: invoke pytest directly rather than via `python -m pytest` (#3523)" This reverts commit a2e21cb. * Replace pipes.quote with shlex.quote in lit config (#3547) * Replace pipes.quote with shlex.quote * Drop TBB run on windows to unblock CI * Update ci/matrix.yaml Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: Bernhard Manfred Gruber <[email protected]> * Remove nvks runners from testing pool. (#3580) --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Wesley Maxey <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: Allison Piper <[email protected]>

@shwina

…e 2.8.x branch. (NVIDIA#3536) * [FEA]: Introduce Python module with CCCL headers (NVIDIA#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> * cuda.parallel: invoke pytest directly rather than via `python -m pytest` (NVIDIA#3523) Co-authored-by: Ashwin Srinath <[email protected]> * Copy file from PR NVIDIA#3547 (bugfix/drop_pipe_in_lit by @wmaxey) * Revert "cuda.parallel: invoke pytest directly rather than via `python -m pytest` (NVIDIA#3523)" This reverts commit a2e21cb. * Replace pipes.quote with shlex.quote in lit config (NVIDIA#3547) * Replace pipes.quote with shlex.quote * Drop TBB run on windows to unblock CI * Update ci/matrix.yaml Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: Bernhard Manfred Gruber <[email protected]> * Remove nvks runners from testing pool. (NVIDIA#3580) --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Wesley Maxey <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: Allison Piper <[email protected]>

Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_…

daab580

…cooperative

rwgk mentioned this pull request Dec 19, 2024

[FEA]: Introduce Python module with CCCL headers #2281

Closed

1 task

leofang reviewed Dec 19, 2024

View reviewed changes

python/cuda_cccl/setup.py Outdated Show resolved Hide resolved

Run copy_cccl_headers_to_aude_include() before setup()

ef9d5f4

Create python/cuda_cccl/cuda/_include/__init__.py, then simply import…

bc116dc

… cuda._include to find the include path.

rwgk added 4 commits December 19, 2024 17:18

Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

2913ae0

Bug fix: cuda/_include only exists after shutil.copytree() ran.

7dbb82b

Use f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl" in setup.py

0703901

Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py…

fc0e543

… (they are equivalent to the default functions)

Replace := operator (needs Python 3.8+)

2e64345

rwgk marked this pull request as ready for review December 20, 2024 03:13

rwgk requested review from a team as code owners December 20, 2024 03:13

rwgk requested review from jrhemstad and miscco and removed request for miscco and jrhemstad December 20, 2024 03:13

Merge branch 'main' into pip-cuda-cccl [skip-rapids][skip-matx][skip-…

1f4d210

…docs][skip-vdc][skip pre-commit.ci]

This was referenced Jan 16, 2025

Accident, please ignore. NVIDIA/cuda-python#405

Closed

[FEA]: Enable taplo-lint pre-commit #3426

Open

rwgk added 2 commits January 16, 2025 09:38

Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][…

f94bbb1

…skip-docs][skip-vdc]

Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][s…

b48f866

…kip-vdc][skip pre-commit.ci]" This reverts commit ec206fd.

shwina reviewed Jan 16, 2025

View reviewed changes

python/cuda_cccl/cuda/cccl/__init__.py Show resolved Hide resolved

rwgk added 2 commits January 16, 2025 10:51

Implement suggestion by @shwina (NVIDIA#3201 (review))

917147f

Merge branch 'main' into pip-cuda-cccl

ebdbb22

leofang requested changes Jan 16, 2025

View reviewed changes

Address feedback by @leofang

12dbf29

rwgk marked this pull request as ready for review January 16, 2025 20:07

rwgk requested a review from a team as a code owner January 16, 2025 20:07

leofang approved these changes Jan 16, 2025

View reviewed changes

shwina approved these changes Jan 17, 2025

View reviewed changes

rwgk merged commit 3e1e6e0 into NVIDIA:main Jan 17, 2025
168 of 171 checks passed

rwgk deleted the pip-cuda-cccl branch January 17, 2025 03:52

rwgk mentioned this pull request Jan 18, 2025

[ENH]: Improve/Rewrite cuda.parallel's build system #2334

Open

rwgk mentioned this pull request Jan 18, 2025

[FEA]: Replace __del__ with weakref.finalize() in cuda/parallel/experimental/algorithms/ #3444

Open

rwgk mentioned this pull request Jan 27, 2025

Backport PRs #3201, #3523, #3547, #3580 to the 2.8.x branch. #3536

Merged

This was referenced Jan 30, 2025

[Backport 2.7]: PRs #3201, #3523, #3547, #3580 (#3536) #3600

Open

[Backport] #3572 to the 2.8.x branch. #3605

Merged

[FEA]: Introduce Python module with CCCL headers #3201

[FEA]: Introduce Python module with CCCL headers #3201

Conversation

rwgk commented Dec 19, 2024 • edited Loading

Description

copy-pr-bot bot commented Dec 19, 2024

leofang commented Dec 19, 2024

rwgk commented Dec 19, 2024

rwgk commented Dec 19, 2024

rwgk commented Dec 19, 2024

rwgk commented Dec 19, 2024

rwgk commented Dec 19, 2024

rwgk commented Dec 20, 2024

rwgk commented Dec 20, 2024

rwgk commented Dec 20, 2024

rwgk commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

🟩 libcudacxx: Pass: 100%/48 | Total: 7h 29m | Avg: 9m 22s | Max: 36m 14s | Hits: 98%/9814

🟩 cub: Pass: 100%/47 | Total: 7h 03m | Avg: 9m 00s | Max: 31m 24s | Hits: 99%/3124

🟩 thrust: Pass: 100%/46 | Total: 6h 20m | Avg: 8m 16s | Max: 33m 45s | Hits: 99%/9260

🟩 cudax: Pass: 100%/26 | Total: 2h 18m | Avg: 5m 19s | Max: 31m 25s | Hits: 92%/312

🟩 cccl: Pass: 100%/6 | Total: 27m 15s | Avg: 4m 32s | Max: 4m 51s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 04s | Avg: 5m 02s | Max: 8m 04s

🟩 python: Pass: 100%/1 | Total: 44m 12s | Avg: 44m 12s | Max: 44m 12s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 176)

leofang left a comment

Choose a reason for hiding this comment

leofang Jan 16, 2025

Choose a reason for hiding this comment

rwgk Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

leofang Jan 16, 2025

Choose a reason for hiding this comment

leofang Jan 16, 2025

Choose a reason for hiding this comment

github-actions bot commented Jan 16, 2025

🟩 libcudacxx: Pass: 100%/46 | Total: 9h 41m | Avg: 12m 38s | Max: 37m 03s | Hits: 668%/12541

🟩 cub: Pass: 100%/38 | Total: 15h 18m | Avg: 24m 10s | Max: 1h 32m | Hits: 307%/3540

🟩 thrust: Pass: 100%/37 | Total: 9h 48m | Avg: 15m 53s | Max: 53m 48s | Hits: 226%/9220

🟩 cudax: Pass: 100%/20 | Total: 2h 03m | Avg: 6m 11s | Max: 20m 13s | Hits: 383%/522

🟩 cccl: Pass: 100%/4 | Total: 20m 48s | Avg: 5m 12s | Max: 5m 35s

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 55s | Avg: 4m 57s | Max: 7m 38s

🟩 python: Pass: 100%/1 | Total: 42m 45s | Avg: 42m 45s | Max: 42m 45s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 148)

rwgk commented Dec 19, 2024 •

edited

Loading

rwgk Jan 16, 2025 •

edited

Loading