Introduce cuda.cooperative overloads not requiring temporary storage #2528

gevtushenko · 2024-10-09T23:31:51Z

Description

This PR introduces versions of cooperative algorithms that do not require temporary storage. This is a quick fix for temporary storage alignment issues when having more than one shared memory array.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

github-actions · 2024-10-09T23:50:01Z

🟩 CI finished in 14m 16s: Pass: 100%/1 | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s

🟩 pycuda: Pass: 100%/1 | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
🟩 ctk
  🟩 12.5               Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
🟩 cudacxx
  🟩 nvcc12.5           Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	pycuda
	CCCL C Parallel Library

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	pycuda
	CCCL C Parallel Library

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

gevtushenko · 2024-10-10T15:41:15Z

A few things to consider before merging:

a) sync is required before subsequent calls, which is not obvious, so we might need to add a sync inside the call
b) warp scope would allocate temp storage for a single warp only, so we either should over allocate temporary storage for 1024 / warp size warps, or disallow temporary storage-free API at warp scope

python/cuda_cooperative/cuda/cooperative/experimental/_types.py

…porary_storage

copy-pr-bot · 2024-12-05T07:59:08Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

gevtushenko · 2024-12-05T08:25:20Z

/ok to test

github-actions · 2024-12-05T10:01:35Z

🟩 CI finished in 23m 47s: Pass: 100%/1 | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s

🟩 python: Pass: 100%/1 | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

…VIDIA#2528) * Modernize pkg resource query * Add cooperative overloads without shared memory * Start fixing temp storage * Incorporate template params into mangling * Condence dict access * Fix temporary storage indexing for sub hw waprs * Test multiple warps * Disable alloc API for sub hw warps

gevtushenko added 2 commits October 9, 2024 16:15

Modernize pkg resource query

d449923

Add cooperative overloads without shared memory

c2ba382

gevtushenko requested a review from a team as a code owner October 9, 2024 23:31

gevtushenko requested a review from griwes October 9, 2024 23:31

gevtushenko requested a review from elstehle October 10, 2024 04:20

gevtushenko marked this pull request as draft October 10, 2024 15:39

rwgk reviewed Oct 18, 2024

View reviewed changes

python/cuda_cooperative/cuda/cooperative/experimental/_types.py Outdated Show resolved Hide resolved

gevtushenko added 6 commits December 4, 2024 20:55

Merge remote-tracking branch 'upstream/main' into enh-main/github/tem…

5a8a396

…porary_storage

Start fixing temp storage

947b71e

Incorporate template params into mangling

b69ab1d

Condence dict access

b307f1c

Fix temporary storage indexing for sub hw waprs

8f190f7

Test multiple warps

95a31fd

Disable alloc API for sub hw warps

d20ae5c

gevtushenko marked this pull request as ready for review December 5, 2024 08:24

elstehle approved these changes Dec 5, 2024

View reviewed changes

gevtushenko merged commit bca6231 into NVIDIA:main Dec 5, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce cuda.cooperative overloads not requiring temporary storage #2528

Introduce cuda.cooperative overloads not requiring temporary storage #2528

gevtushenko commented Oct 9, 2024

github-actions bot commented Oct 9, 2024

🟩 pycuda: Pass: 100%/1 | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

gevtushenko commented Oct 10, 2024

copy-pr-bot bot commented Dec 5, 2024

gevtushenko commented Dec 5, 2024

github-actions bot commented Dec 5, 2024

🟩 python: Pass: 100%/1 | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

Introduce cuda.cooperative overloads not requiring temporary storage #2528

Introduce cuda.cooperative overloads not requiring temporary storage #2528

Conversation

gevtushenko commented Oct 9, 2024

Description

Checklist

github-actions bot commented Oct 9, 2024

🟩 pycuda: Pass: 100%/1 | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

gevtushenko commented Oct 10, 2024

copy-pr-bot bot commented Dec 5, 2024

gevtushenko commented Dec 5, 2024

github-actions bot commented Dec 5, 2024

🟩 python: Pass: 100%/1 | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)