cuda.parallel: Minor perf improvements #3718

shwina · 2025-02-06T15:30:53Z

Description

This PR addresses some of the performance issues found by @oleksandr-pavlyk' in #3213.

Changes introduced in this PR

Mainly, the performance improvement comes from the following:

Removing type validation between the calls to Reduce.__init__ and Reduce.__call__: while this removes several guardrails, I think it's appropriate. Higher level APIs can hide the Reduce object from the user altogether and ensure that there is no way to pass objects of different dtype between the calls to __init__ and __call__.
Adding fast paths for protocols.get_data_ptr and protocols.get_dtype: introspecting __cuda_array_interface__ for the data pointer and dtype is slow. Until we can figure out a faster, more general way to get that information for different array types, this PR adds a fast path that works for CuPy (and Numba) arrays specifically. For other array types (like torch tensors for example), it will fall back to the regular (slower) path.
Using CuPy to query the current device's compute capability: as described in Querying current device is slow compared to CuPy cuda-python#439, querying the CC is quite slow (using both Numba and CUDA-Python), compared to CuPy.

Results

The plot below shows the performance improvement that this PR brings to reduce() v/s the main branch:

I used Sasha's benchmarking scripts here to generate these results.

Alternatives

One idea that came up in a conversation with @leofang: we could consider changing the API to not accept __cuda_array_interface__ objects, and instead have the user pass in the required information (pointer, size, dtype, etc.,). This allows each library/user to compute that information in the most efficient way possible rather than making it our responsibility.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

python/cuda_parallel/cuda/parallel/experimental/_utils/protocols.py

github-actions · 2025-02-06T15:38:54Z

🟥 CI finished in 5m 58s: Pass: 0%/1 | Total: 5m 58s | Avg: 5m 58s | Max: 5m 58s

🟥 python: Pass: 0%/1 | Total: 5m 58s | Avg: 5m 58s | Max: 5m 58s

🟥 cpu
  🟥 amd64              Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
🟥 ctk
  🟥 12.8               Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
🟥 cudacxx
  🟥 nvcc12.8           Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
🟥 cudacxx_family
  🟥 nvcc               Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
🟥 cxx
  🟥 GCC13              Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
🟥 cxx_family
  🟥 GCC                Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
🟥 gpu
  🟥 rtx2080            Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
🟥 jobs
  🟥 Test               Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

jrhemstad · 2025-02-06T15:40:54Z

How are we compared to cupy now?

github-actions · 2025-02-06T15:46:24Z

🟥 CI finished in 5m 55s: Pass: 0%/1 | Total: 5m 55s | Avg: 5m 55s | Max: 5m 55s

🟥 python: Pass: 0%/1 | Total: 5m 55s | Avg: 5m 55s | Max: 5m 55s

🟥 cpu
  🟥 amd64              Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
🟥 ctk
  🟥 12.8               Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
🟥 cudacxx
  🟥 nvcc12.8           Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
🟥 cudacxx_family
  🟥 nvcc               Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
🟥 cxx
  🟥 GCC13              Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
🟥 cxx_family
  🟥 GCC                Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
🟥 gpu
  🟥 rtx2080            Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
🟥 jobs
  🟥 Test               Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

copy-pr-bot · 2025-02-06T16:02:02Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

shwina · 2025-02-06T16:03:07Z

How are we compared to cupy now?

Closer, but not there quite yet. We have ~15us of constant overhead versus CuPy's ~10us. I'll iterate on this PR until we reach parity

leofang · 2025-02-06T18:02:07Z

Closer, but not there quite yet. We have ~15ms of constant overhead versus CuPy's ~10ms.

btw I think you meant us (microseconds) not ms (millisecond). I feel we are pushing to the limit where Python overhead could be something to worry about.

shwina · 2025-02-06T19:51:58Z

Closer, but not there quite yet. We have ~15ms of constant overhead versus CuPy's ~10ms. I'll iterate on this PR until we reach parity

With the latest changes which rip out all the validation checks we do between the call to Reduce.__init__ and Reduce.__call__, as well as using CuPy to get the current device's compute capability, we do reach parity with CuPy:

Benchmark Results (input size, average time with first run, average time without first run):
Input size:         10 | Avg time with first run: 0.00106483 seconds | Avg time without first run: 0.00001107 seconds
Input size:        100 | Avg time with first run: 0.00001096 seconds | Avg time without first run: 0.00001095 seconds
Input size:       1000 | Avg time with first run: 0.00001093 seconds | Avg time without first run: 0.00001092 seconds
Input size:      10000 | Avg time with first run: 0.00001658 seconds | Avg time without first run: 0.00001652 seconds
Input size:     100000 | Avg time with first run: 0.00005286 seconds | Avg time without first run: 0.00005286 seconds
Input size:    1000000 | Avg time with first run: 0.00021406 seconds | Avg time without first run: 0.00020699 seconds
Input size:   10000000 | Avg time with first run: 0.00105273 seconds | Avg time without first run: 0.00105112 seconds
Input size:  100000000 | Avg time with first run: 0.01051427 seconds | Avg time without first run: 0.01051234 seconds

I feel we are pushing to the limit where Python overhead could be something to worry about.

We are absolutely there already - this PR is trying to minimize the number of Python operations we're doing in the __call__ method.

shwina · 2025-02-06T19:59:46Z

/ok to test

github-actions · 2025-02-06T20:19:42Z

🟥 CI finished in 6m 06s: Pass: 0%/1 | Total: 6m 06s | Avg: 6m 06s | Max: 6m 06s

🟥 python: Pass: 0%/1 | Total: 6m 06s | Avg: 6m 06s | Max: 6m 06s

🟥 cpu
  🟥 amd64              Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
🟥 ctk
  🟥 12.8               Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
🟥 cudacxx
  🟥 nvcc12.8           Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
🟥 cudacxx_family
  🟥 nvcc               Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
🟥 cxx
  🟥 GCC13              Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
🟥 cxx_family
  🟥 GCC                Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
🟥 gpu
  🟥 rtx2080            Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
🟥 jobs
  🟥 Test               Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

shwina · 2025-02-06T20:41:01Z

/ok to test

github-actions · 2025-02-06T20:49:53Z

🟥 CI finished in 6m 05s: Pass: 0%/1 | Total: 6m 05s | Avg: 6m 05s | Max: 6m 05s

🟥 python: Pass: 0%/1 | Total: 6m 05s | Avg: 6m 05s | Max: 6m 05s

🟥 cpu
  🟥 amd64              Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
🟥 ctk
  🟥 12.8               Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
🟥 cudacxx
  🟥 nvcc12.8           Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
🟥 cudacxx_family
  🟥 nvcc               Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
🟥 cxx
  🟥 GCC13              Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
🟥 cxx_family
  🟥 GCC                Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
🟥 gpu
  🟥 rtx2080            Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
🟥 jobs
  🟥 Test               Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

shwina · 2025-02-06T21:02:06Z

/ok to test

github-actions · 2025-02-06T21:41:54Z

🟩 CI finished in 33m 23s: Pass: 100%/1 | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s

🟩 python: Pass: 100%/1 | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

shwina · 2025-02-07T18:04:29Z

/ok to test

github-actions · 2025-02-07T19:09:41Z

🟩 CI finished in 28m 40s: Pass: 100%/1 | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s

🟩 python: Pass: 100%/1 | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

leofang · 2025-02-10T14:31:58Z

Removing type validation between the calls to Reduce.__init__ and Reduce.__call__: while this removes several guardrails, I think it's appropriate. Higher level APIs can hide the Reduce object from the user altogether and ensure that there is no way to pass objects of different dtype between the calls to __init__ and __call__.

In the near future we should consider establishing an API contract for plan building and plan execution (#2429 (comment)).

we could consider changing the API to not accept __cuda_array_interface__ objects, and instead have the user pass in the required information (pointer, size, dtype, etc.,). This allows each library/user to compute that information in the most efficient way possible rather than making it our responsibility.

Let's have a separate issue to track this. Thinking about this more we should try to make the current (low-level) interface look more like a 1:1 binding to the bare C++ one. This is what we do for cuda.cooperative too. Pythonic interface can come later.

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py

github-actions · 2025-02-10T15:51:58Z

🟥 CI finished in 5m 44s: Pass: 0%/1 | Total: 5m 44s | Avg: 5m 44s | Max: 5m 44s

🟥 python: Pass: 0%/1 | Total: 5m 44s | Avg: 5m 44s | Max: 5m 44s

🟥 cpu
  🟥 amd64              Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
🟥 ctk
  🟥 12.8               Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
🟥 cudacxx
  🟥 nvcc12.8           Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
🟥 cudacxx_family
  🟥 nvcc               Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
🟥 cxx
  🟥 GCC13              Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
🟥 cxx_family
  🟥 GCC                Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
🟥 gpu
  🟥 rtx2080            Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
🟥 jobs
  🟥 Test               Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

github-actions · 2025-02-10T16:59:53Z

🟩 CI finished in 29m 45s: Pass: 100%/1 | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s

🟩 python: Pass: 100%/1 | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

oleksandr-pavlyk · 2025-02-10T20:02:20Z

In the function is_contiguous we do

def is_contiguous(arr: DeviceArrayLike) -> bool:
    shape, strides = get_shape(arr), get_strides(arr)

    if strides is None:
        return True

    if any(dim == 0 for dim in shape):
        # array has no elements
        return True
   [---SNIPPED--]

but we do not use shape if strides is None. So I propose to speed-up that case by changing the function to:

def is_contiguous(arr: DeviceArrayLike) -> bool:
    strides = get_strides(arr)

    if strides is None:
        return True

    shape = get_shape(arr)
    if any(dim == 0 for dim in shape):
        # array has no elements
        return True
   [---SNIPPED--]

python/cuda_parallel/cuda/parallel/experimental/_utils/protocols.py

python/cuda_parallel/tests/test_reduce.py

python/cuda_parallel/tests/test_reduce_api.py

github-actions · 2025-02-10T21:34:57Z

🟩 CI finished in 34m 26s: Pass: 100%/1 | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s

🟩 python: Pass: 100%/1 | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

oleksandr-pavlyk

Looks good to me @shwina

shwina requested a review from a team as a code owner February 6, 2025 15:30

shwina requested a review from gevtushenko February 6, 2025 15:30

oleksandr-pavlyk reviewed Feb 6, 2025

View reviewed changes

python/cuda_parallel/cuda/parallel/experimental/_utils/protocols.py Outdated Show resolved Hide resolved

shwina force-pushed the cuda-parallel-minor-perf-improvements branch from bf5b043 to 4dcfc6f Compare February 6, 2025 15:39

shwina marked this pull request as draft February 6, 2025 16:02

shwina force-pushed the cuda-parallel-minor-perf-improvements branch from 0f404e7 to 41f652d Compare February 7, 2025 16:38

shwina force-pushed the cuda-parallel-minor-perf-improvements branch from 8bac88d to 2d2af2c Compare February 10, 2025 12:16

Add get_data_pointer utility and fast paths for CuPy

525a2ed

shwina force-pushed the cuda-parallel-minor-perf-improvements branch from ed71555 to 5d946c1 Compare February 10, 2025 12:41

shwina added 4 commits February 10, 2025 07:43

Remove type validations between calls to __init__ and __call__

cb2b945

Require array size

9e09571

Use CuPy to get the compute capability

08d0643

Struct scalars support __array_interface__

0429181

shwina force-pushed the cuda-parallel-minor-perf-improvements branch from 5d946c1 to 0429181 Compare February 10, 2025 12:43

shwina marked this pull request as ready for review February 10, 2025 14:15

shwina requested a review from a team as a code owner February 10, 2025 14:15

shwina requested a review from alliepiper February 10, 2025 14:15

leofang reviewed Feb 10, 2025

View reviewed changes

python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py Show resolved Hide resolved

leofang approved these changes Feb 10, 2025

View reviewed changes

shwina force-pushed the cuda-parallel-minor-perf-improvements branch from 645d11c to 0429181 Compare February 10, 2025 16:13

oleksandr-pavlyk reviewed Feb 10, 2025

View reviewed changes

python/cuda_parallel/cuda/parallel/experimental/_utils/protocols.py Outdated Show resolved Hide resolved

oleksandr-pavlyk reviewed Feb 10, 2025

View reviewed changes

python/cuda_parallel/tests/test_reduce.py Outdated Show resolved Hide resolved

oleksandr-pavlyk reviewed Feb 10, 2025

View reviewed changes

python/cuda_parallel/tests/test_reduce.py Outdated Show resolved Hide resolved

oleksandr-pavlyk reviewed Feb 10, 2025

View reviewed changes

python/cuda_parallel/tests/test_reduce_api.py Show resolved Hide resolved

Address review feedback

b8474a4

shwina requested a review from oleksandr-pavlyk February 10, 2025 22:02

oleksandr-pavlyk approved these changes Feb 11, 2025

View reviewed changes

leofang merged commit a03ce7b into NVIDIA:main Feb 11, 2025
20 of 23 checks passed

shwina mentioned this pull request Feb 12, 2025

Investigate performance delta between cuda.parallel and CuPy reduction #3213

Closed

leofang mentioned this pull request Feb 14, 2025

Discussion: Implement "lower-level" APIs for cuda.parallel that do not accept array inputs? #3812

Open

rwgk mentioned this pull request Feb 14, 2025

Add Python wrappers for c.parallel merge_sort API #3763

Merged

2 tasks

cuda.parallel: Minor perf improvements #3718

cuda.parallel: Minor perf improvements #3718

Conversation

shwina commented Feb 6, 2025 • edited Loading

Description

Changes introduced in this PR

Results

Alternatives

Checklist

github-actions bot commented Feb 6, 2025

🟥 python: Pass: 0%/1 | Total: 5m 58s | Avg: 5m 58s | Max: 5m 58s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

jrhemstad commented Feb 6, 2025

github-actions bot commented Feb 6, 2025

🟥 python: Pass: 0%/1 | Total: 5m 55s | Avg: 5m 55s | Max: 5m 55s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

copy-pr-bot bot commented Feb 6, 2025

shwina commented Feb 6, 2025 • edited Loading

leofang commented Feb 6, 2025

shwina commented Feb 6, 2025

shwina commented Feb 6, 2025

github-actions bot commented Feb 6, 2025

🟥 python: Pass: 0%/1 | Total: 6m 06s | Avg: 6m 06s | Max: 6m 06s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina commented Feb 6, 2025

github-actions bot commented Feb 6, 2025

🟥 python: Pass: 0%/1 | Total: 6m 05s | Avg: 6m 05s | Max: 6m 05s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina commented Feb 6, 2025

github-actions bot commented Feb 6, 2025

🟩 python: Pass: 100%/1 | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

shwina commented Feb 7, 2025

github-actions bot commented Feb 7, 2025

🟩 python: Pass: 100%/1 | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

leofang commented Feb 10, 2025 • edited Loading

github-actions bot commented Feb 10, 2025

🟥 python: Pass: 0%/1 | Total: 5m 44s | Avg: 5m 44s | Max: 5m 44s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

github-actions bot commented Feb 10, 2025

🟩 python: Pass: 100%/1 | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

oleksandr-pavlyk commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

🟩 python: Pass: 100%/1 | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

shwina commented Feb 6, 2025 •

edited

Loading

shwina commented Feb 6, 2025 •

edited

Loading

leofang commented Feb 10, 2025 •

edited

Loading