[cuda.cooperative] Add block.load and block.store. #2693

brycelelbach · 2024-11-04T14:47:21Z

Description

This PR adds cub::BlockLoad and cub::BlockStore to cuda.cooperative as block.load and block.store respectively.

Both of these algorithms take the input/output as a C++ iterator. To support them in cuda.cooperative, I added a DependentPointer templating facility based on the existing Pointer and DependentReference. This seems to work fine, however I ran into a weird issue deep in Numba where an array-to-array conversion failed because the type of array produced by Pointer had a Numba layout of 'C' instead of 'A' (which means any layout). I changed the type of array produced by Pointer to 'A' which seems to have done the trick. It's also notable that I once ended up at this assertion when I had a float64 array that was trying to be converted to a float32 array. It seems like that should have been caught and reported earlier in Numba.

I exposed the CUB load/store algorithm parameters, e.g. cub::BLOCK_LOAD_TRANSPOSE. I decided to have these parameters passed as strings in Python, instead of a Python enum or named objects. This more closely matches the parameter-passing style of NumPy (layout='C', dtype='float32'), is less verbose, and doesn't preclude us adding enums or named objects later. I chose to not gives the names prefixes and give the same names for load and store, e.g. 'transpose'.

Currently I'm having an issue with block.store in my softmax example but I suspect it's just a bug in my code.

Checklist

Implementation.
Add tests.
Add docs.
Confirm Pointer array layout change from 'A' to 'C'.
Investigate issue with block.store in softmax example.

copy-pr-bot · 2024-11-04T14:47:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

brycelelbach · 2024-11-04T14:54:17Z

python/cuda_cooperative/cuda/cooperative/experimental/_types.py


    def dtype(self):
-        return numba.types.Array(self.value_dtype, 1, 'C')
+        return numba.types.Array(self.value_dtype, 1, 'A')


I made this change because I was getting failures in array_to_array in Numba, which gets called when there's an implicit cast between array types. It's unhappy if you're casting to an array type that doesn't have the 'A' layout. The 'A' layout means the array could be in any layout, so this makes sense; you can cast a 'C' (row-major) or 'F' (column-major) array to 'A', but not vice versa. It's akin to being able to cast anything to void* implicitly, but not explicitly.

It's also notable that I once ended up at this assertion when I had a float64 array that was trying to be converted to a float32 array. It seems like that should have been caught and reported earlier in Numba.

Perhaps Graham Markall can provide some insight.

leofang · 2024-11-04T16:31:31Z

cc @rwgk @emcastillo for vis

…lock_load_store

…/block_load_store

gevtushenko · 2024-12-05T01:43:46Z

/ok to test

gevtushenko · 2024-12-05T02:01:38Z

/ok to test

gevtushenko · 2024-12-05T02:27:21Z

/ok to test

github-actions · 2024-12-05T04:29:41Z

🟩 CI finished in 24m 17s: Pass: 100%/1 | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s

🟩 python: Pass: 100%/1 | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-v100-latest-1`

* [cuda.cooperative] Add block.load and block.store. --------- Co-authored-by: Georgy Evtushenko <[email protected]>

[cuda.cooperative] Add block.load and block.store.

4d6cb1d

brycelelbach requested a review from a team as a code owner November 4, 2024 14:47

brycelelbach requested a review from griwes November 4, 2024 14:47

brycelelbach assigned gevtushenko Nov 4, 2024

brycelelbach commented Nov 4, 2024

View reviewed changes

gevtushenko added 6 commits December 4, 2024 16:28

Merge remote-tracking branch 'origin/main' into pr/cuda.cooperative/b…

8af02d5

…lock_load_store

Fix pkg warning

551f232

Fix cuda.bindings warning

1d1a99a

Test block store and load

cf7695b

Merge remote-tracking branch 'upstream/main' into pr/cuda.cooperative…

29f8930

…/block_load_store

Format

82695ae

gevtushenko approved these changes Dec 5, 2024

View reviewed changes

This was linked to issues Dec 5, 2024

[FEA]: Implement cuda.cooperative version of block load #2561

Closed

[FEA]: Implement cuda.cooperative version of block store #2562

Closed

Fix mock imports

532b357

Fix mock imports

51602f6

gevtushenko enabled auto-merge (squash) December 5, 2024 02:37

gevtushenko merged commit a92db6b into NVIDIA:main Dec 5, 2024
18 checks passed

pciolkosz pushed a commit to pciolkosz/cccl that referenced this pull request Dec 6, 2024

[cuda.cooperative] Add block.load and block.store. (NVIDIA#2693)

7eb1758

* [cuda.cooperative] Add block.load and block.store. --------- Co-authored-by: Georgy Evtushenko <[email protected]>

brycelelbach deleted the pr/cuda.cooperative/block_load_store branch February 1, 2025 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda.cooperative] Add block.load and block.store. #2693

[cuda.cooperative] Add block.load and block.store. #2693

brycelelbach commented Nov 4, 2024 •

edited

Loading

copy-pr-bot bot commented Nov 4, 2024

brycelelbach Nov 4, 2024 •

edited

Loading

leofang commented Nov 4, 2024 •

edited

Loading

gevtushenko commented Dec 5, 2024

gevtushenko commented Dec 5, 2024

gevtushenko commented Dec 5, 2024

github-actions bot commented Dec 5, 2024

🟩 python: Pass: 100%/1 | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

[cuda.cooperative] Add block.load and block.store. #2693

[cuda.cooperative] Add block.load and block.store. #2693

Conversation

brycelelbach commented Nov 4, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Nov 4, 2024

brycelelbach Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

leofang commented Nov 4, 2024 • edited Loading

gevtushenko commented Dec 5, 2024

gevtushenko commented Dec 5, 2024

gevtushenko commented Dec 5, 2024

github-actions bot commented Dec 5, 2024

🟩 python: Pass: 100%/1 | Total: 24m 17s | Avg: 24m 17s | Max: 24m 17s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

brycelelbach commented Nov 4, 2024 •

edited

Loading

brycelelbach Nov 4, 2024 •

edited

Loading

leofang commented Nov 4, 2024 •

edited

Loading