Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce cuda.cooperative overloads not requiring temporary storage #2528

Merged

Conversation

gevtushenko
Copy link
Collaborator

Description

closes #2527

This PR introduces versions of cooperative algorithms that do not require temporary storage. This is a quick fix for temporary storage alignment issues when having more than one shared memory array.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@gevtushenko gevtushenko requested a review from a team as a code owner October 9, 2024 23:31
@gevtushenko gevtushenko requested a review from griwes October 9, 2024 23:31
Copy link
Contributor

github-actions bot commented Oct 9, 2024

🟩 CI finished in 14m 16s: Pass: 100%/1 | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
  • 🟩 pycuda: Pass: 100%/1 | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 14m 16s | Avg: 14m 16s | Max: 14m 16s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- pycuda
CCCL C Parallel Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- pycuda
CCCL C Parallel Library

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-v100-latest-1

@gevtushenko gevtushenko requested a review from elstehle October 10, 2024 04:20
@gevtushenko gevtushenko marked this pull request as draft October 10, 2024 15:39
@gevtushenko
Copy link
Collaborator Author

A few things to consider before merging:

a) sync is required before subsequent calls, which is not obvious, so we might need to add a sync inside the call
b) warp scope would allocate temp storage for a single warp only, so we either should over allocate temporary storage for 1024 / warp size warps, or disallow temporary storage-free API at warp scope

Copy link

copy-pr-bot bot commented Dec 5, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@gevtushenko gevtushenko marked this pull request as ready for review December 5, 2024 08:24
@gevtushenko
Copy link
Collaborator Author

/ok to test

Copy link
Contributor

github-actions bot commented Dec 5, 2024

🟩 CI finished in 23m 47s: Pass: 100%/1 | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
  • 🟩 python: Pass: 100%/1 | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 23m 47s | Avg: 23m 47s | Max: 23m 47s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-v100-latest-1

@gevtushenko gevtushenko merged commit bca6231 into NVIDIA:main Dec 5, 2024
20 checks passed
pciolkosz pushed a commit to pciolkosz/cccl that referenced this pull request Dec 6, 2024
…VIDIA#2528)

* Modernize pkg resource query

* Add cooperative overloads without shared memory

* Start fixing temp storage

* Incorporate template params into mangling

* Condence dict access

* Fix temporary storage indexing for sub hw waprs

* Test multiple warps

* Disable alloc API for sub hw warps
andralex pushed a commit to caugonnet/cccl that referenced this pull request Dec 7, 2024
…VIDIA#2528)

* Modernize pkg resource query

* Add cooperative overloads without shared memory

* Start fixing temp storage

* Incorporate template params into mangling

* Condence dict access

* Fix temporary storage indexing for sub hw waprs

* Test multiple warps

* Disable alloc API for sub hw warps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA]: Introduce cuda.cooperative overloads not requiring temporary storage
3 participants