Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cuda.cooperative] Add block.load and block.store. #2693

Merged

Conversation

brycelelbach
Copy link
Contributor

@brycelelbach brycelelbach commented Nov 4, 2024

Description

This PR adds cub::BlockLoad and cub::BlockStore to cuda.cooperative as block.load and block.store respectively.

Both of these algorithms take the input/output as a C++ iterator. To support them in cuda.cooperative, I added a DependentPointer templating facility based on the existing Pointer and DependentReference. This seems to work fine, however I ran into a weird issue deep in Numba where an array-to-array conversion failed because the type of array produced by Pointer had a Numba layout of 'C' instead of 'A' (which means any layout). I changed the type of array produced by Pointer to 'A' which seems to have done the trick. It's also notable that I once ended up at this assertion when I had a float64 array that was trying to be converted to a float32 array. It seems like that should have been caught and reported earlier in Numba.

I exposed the CUB load/store algorithm parameters, e.g. cub::BLOCK_LOAD_TRANSPOSE. I decided to have these parameters passed as strings in Python, instead of a Python enum or named objects. This more closely matches the parameter-passing style of NumPy (layout='C', dtype='float32'), is less verbose, and doesn't preclude us adding enums or named objects later. I chose to not gives the names prefixes and give the same names for load and store, e.g. 'transpose'.

Currently I'm having an issue with block.store in my softmax example but I suspect it's just a bug in my code.

Checklist

  • Implementation.
  • Add tests.
  • Add docs.
  • Confirm Pointer array layout change from 'A' to 'C'.
  • Investigate issue with block.store in softmax example.

@brycelelbach brycelelbach requested a review from a team as a code owner November 4, 2024 14:47
@brycelelbach brycelelbach requested a review from griwes November 4, 2024 14:47
Copy link

copy-pr-bot bot commented Nov 4, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


def dtype(self):
return numba.types.Array(self.value_dtype, 1, 'C')
return numba.types.Array(self.value_dtype, 1, 'A')
Copy link
Contributor Author

@brycelelbach brycelelbach Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change because I was getting failures in array_to_array in Numba, which gets called when there's an implicit cast between array types. It's unhappy if you're casting to an array type that doesn't have the 'A' layout. The 'A' layout means the array could be in any layout, so this makes sense; you can cast a 'C' (row-major) or 'F' (column-major) array to 'A', but not vice versa. It's akin to being able to cast anything to void* implicitly, but not explicitly.

It's also notable that I once ended up at this assertion when I had a float64 array that was trying to be converted to a float32 array. It seems like that should have been caught and reported earlier in Numba.

Perhaps Graham Markall can provide some insight.

@leofang
Copy link
Member

leofang commented Nov 4, 2024

cc @rwgk @emcastillo for vis

@gevtushenko
Copy link
Collaborator

/ok to test

@gevtushenko
Copy link
Collaborator

/ok to test

@gevtushenko
Copy link
Collaborator

/ok to test

@gevtushenko gevtushenko enabled auto-merge (squash) December 5, 2024 02:37
Copy link
Contributor

github-actions bot commented Dec 5, 2024

🟩 CI finished in 24m 17s: Pass: 100%/1 | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
  • 🟩 python: Pass: 100%/1 | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-v100-latest-1

@gevtushenko gevtushenko merged commit a92db6b into NVIDIA:main Dec 5, 2024
18 checks passed
pciolkosz pushed a commit to pciolkosz/cccl that referenced this pull request Dec 6, 2024
* [cuda.cooperative] Add block.load and block.store.

---------

Co-authored-by: Georgy Evtushenko <[email protected]>
@brycelelbach brycelelbach deleted the pr/cuda.cooperative/block_load_store branch February 1, 2025 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
3 participants