Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda.parallel: Minor perf improvements #3718

Merged
merged 6 commits into from
Feb 11, 2025

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Feb 6, 2025

Description

This PR addresses some of the performance issues found by @oleksandr-pavlyk' in #3213.

Changes introduced in this PR

Mainly, the performance improvement comes from the following:

  1. Removing type validation between the calls to Reduce.__init__ and Reduce.__call__: while this removes several guardrails, I think it's appropriate. Higher level APIs can hide the Reduce object from the user altogether and ensure that there is no way to pass objects of different dtype between the calls to __init__ and __call__.

  2. Adding fast paths for protocols.get_data_ptr and protocols.get_dtype: introspecting __cuda_array_interface__ for the data pointer and dtype is slow. Until we can figure out a faster, more general way to get that information for different array types, this PR adds a fast path that works for CuPy (and Numba) arrays specifically. For other array types (like torch tensors for example), it will fall back to the regular (slower) path.

  3. Using CuPy to query the current device's compute capability: as described in Querying current device is slow compared to CuPy cuda-python#439, querying the CC is quite slow (using both Numba and CUDA-Python), compared to CuPy.

Results

The plot below shows the performance improvement that this PR brings to reduce() v/s the main branch:

plot

I used Sasha's benchmarking scripts here to generate these results.

Alternatives

One idea that came up in a conversation with @leofang: we could consider changing the API to not accept __cuda_array_interface__ objects, and instead have the user pass in the required information (pointer, size, dtype, etc.,). This allows each library/user to compute that information in the most efficient way possible rather than making it our responsibility.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina requested a review from a team as a code owner February 6, 2025 15:30
@shwina shwina requested a review from gevtushenko February 6, 2025 15:30
Comment on lines 25 to 39
try:
return np.dtype(arr.dtype) # type: ignore
except Exception:
typestr = arr.__cuda_array_interface__["typestr"]

if typestr.startswith("|V"):
# it's a structured dtype, use the descr field:
return np.dtype(arr.__cuda_array_interface__["descr"])
else:
# a simple dtype, use the typestr field:
return np.dtype(typestr)
if typestr.startswith("|V"):
# it's a structured dtype, use the descr field:
return np.dtype(arr.__cuda_array_interface__["descr"])
else:
# a simple dtype, use the typestr field:
return np.dtype(typestr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to arr.__cuda_array_interface__ was expensive. So it pays to save the interface into a temporary and reuse it in the function scope.

With this approach, we still make repeated calls to arr.__cuda_array_interface__ (in every one of these functions). We should instead create a calls that encapsulate arr in it, and provides cached __cuda_array_interface__ property. This is likely what GpuMemoryView class is meant to be. @leofang for comments

Copy link
Contributor Author

@shwina shwina Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this PR simply adds a fast path for cupy arrays (and numba as well). If something else is passed (e.g., a torch tensor, we fall back to using __cuda_array_interface__).

I found that StridedMemoryView is expensive enough to construct that it doesn't give us sufficient speedup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that StridedMemoryView is expensive enough to construct that it doesn't give us sufficient speedup.

How was the timing done? I would like to know better on this.

Copy link
Member

@leofang leofang Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think the current implementation of cuda/parallel/experimental/_utils/protocols.py should be replaced by StridedMemoryView, because what happens now is the call to __cuda_array_interface__ is not cached, so we repeatedly call it every time to just get one attribute, instead of caching and reusing it for all attribute queries. But once we cache it essentially it is a re-implementation (of some sort) of StridedMemoryView. I don't understand what is going on and would like to get down to the bottom of it. StridedMemoryView is designed so that it supports all single-CPU/GPU array libraries, beyond just CuPy, a capability that we eventually will need for all CUDA projects.

Copy link
Contributor Author

@shwina shwina Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the timing done? I would like to know better on this.

Consider 9cc2902, where I am converting the user-provided arrays to strided memory views and grabbing the data ptr from that. I even deleted all the contiguity checks and other validation code. The benchmarks numbers are then:

Benchmark Results (input size, average time with first run, average time without first run):
Input size:         10 | Avg time with first run: 0.00102104 seconds | Avg time without first run: 0.00002196 seconds
Input size:        100 | Avg time with first run: 0.00002170 seconds | Avg time without first run: 0.00002169 seconds
Input size:       1000 | Avg time with first run: 0.00002289 seconds | Avg time without first run: 0.00002302 seconds
Input size:      10000 | Avg time with first run: 0.00002413 seconds | Avg time without first run: 0.00002405 seconds
Input size:     100000 | Avg time with first run: 0.00004833 seconds | Avg time without first run: 0.00004833 seconds
Input size:    1000000 | Avg time with first run: 0.00019740 seconds | Avg time without first run: 0.00019125 seconds
Input size:   10000000 | Avg time with first run: 0.00104487 seconds | Avg time without first run: 0.00104453 seconds
Input size:  100000000 | Avg time with first run: 0.01038157 seconds | Avg time without first run: 0.01038252 seconds

Which is still a significant improvement over the existing benchmark numbers, but not quite good enough. The overhead here is all in the construction of the StridedMemoryViews from d_in and d_out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ashwin, this helps. Would you mind passing stream_ptr=-1 to StridedMemoryView to bypass the stream ordering, and then point me to where bench.py is so that I can also try this out myself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scripts bench.py and device_reduce.py can be found here: #3213 (comment)

Would you mind passing stream_ptr=-1 to StridedMemoryView

This improved things very slightly:

Benchmark Results (input size, average time with first run, average time without first run):
Input size:         10 | Avg time with first run: 0.00103634 seconds | Avg time without first run: 0.00002310 seconds
Input size:        100 | Avg time with first run: 0.00002255 seconds | Avg time without first run: 0.00002250 seconds
Input size:       1000 | Avg time with first run: 0.00002412 seconds | Avg time without first run: 0.00002272 seconds
Input size:      10000 | Avg time with first run: 0.00002496 seconds | Avg time without first run: 0.00002491 seconds
Input size:     100000 | Avg time with first run: 0.00005463 seconds | Avg time without first run: 0.00005463 seconds
Input size:    1000000 | Avg time with first run: 0.00018084 seconds | Avg time without first run: 0.00016899 seconds
Input size:   10000000 | Avg time with first run: 0.00106087 seconds | Avg time without first run: 0.00104471 seconds
Input size:  100000000 | Avg time with first run: 0.01038340 seconds | Avg time without first run: 0.01038483 seconds

Copy link
Contributor

github-actions bot commented Feb 6, 2025

🟥 CI finished in 5m 58s: Pass: 0%/1 | Total: 5m 58s | Avg: 5m 58s | Max: 5m 58s
  • 🟥 python: Pass: 0%/1 | Total: 5m 58s | Avg: 5m 58s | Max: 5m 58s

    🟥 cpu
      🟥 amd64              Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
    🟥 ctk
      🟥 12.8               Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
    🟥 cudacxx
      🟥 nvcc12.8           Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
    🟥 cudacxx_family
      🟥 nvcc               Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
    🟥 cxx
      🟥 GCC13              Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
    🟥 cxx_family
      🟥 GCC                Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
    🟥 gpu
      🟥 rtx2080            Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
    🟥 jobs
      🟥 Test               Pass:   0%/1   | Total:  5m 58s | Avg:  5m 58s | Max:  5m 58s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

@shwina shwina force-pushed the cuda-parallel-minor-perf-improvements branch from bf5b043 to 4dcfc6f Compare February 6, 2025 15:39
@jrhemstad
Copy link
Collaborator

How are we compared to cupy now?

Copy link
Contributor

github-actions bot commented Feb 6, 2025

🟥 CI finished in 5m 55s: Pass: 0%/1 | Total: 5m 55s | Avg: 5m 55s | Max: 5m 55s
  • 🟥 python: Pass: 0%/1 | Total: 5m 55s | Avg: 5m 55s | Max: 5m 55s

    🟥 cpu
      🟥 amd64              Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
    🟥 ctk
      🟥 12.8               Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
    🟥 cudacxx
      🟥 nvcc12.8           Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
    🟥 cudacxx_family
      🟥 nvcc               Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
    🟥 cxx
      🟥 GCC13              Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
    🟥 cxx_family
      🟥 GCC                Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
    🟥 gpu
      🟥 rtx2080            Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
    🟥 jobs
      🟥 Test               Pass:   0%/1   | Total:  5m 55s | Avg:  5m 55s | Max:  5m 55s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

@shwina shwina marked this pull request as draft February 6, 2025 16:02
Copy link

copy-pr-bot bot commented Feb 6, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@shwina
Copy link
Contributor Author

shwina commented Feb 6, 2025

How are we compared to cupy now?

Closer, but not there quite yet. We have ~15us of constant overhead versus CuPy's ~10us. I'll iterate on this PR until we reach parity

@leofang
Copy link
Member

leofang commented Feb 6, 2025

Closer, but not there quite yet. We have ~15ms of constant overhead versus CuPy's ~10ms.

btw I think you meant us (microseconds) not ms (millisecond). I feel we are pushing to the limit where Python overhead could be something to worry about.

@shwina
Copy link
Contributor Author

shwina commented Feb 6, 2025

Closer, but not there quite yet. We have ~15ms of constant overhead versus CuPy's ~10ms. I'll iterate on this PR until we reach parity

With the latest changes which rip out all the validation checks we do between the call to Reduce.__init__ and Reduce.__call__, as well as using CuPy to get the current device's compute capability, we do reach parity with CuPy:

Benchmark Results (input size, average time with first run, average time without first run):
Input size:         10 | Avg time with first run: 0.00106483 seconds | Avg time without first run: 0.00001107 seconds
Input size:        100 | Avg time with first run: 0.00001096 seconds | Avg time without first run: 0.00001095 seconds
Input size:       1000 | Avg time with first run: 0.00001093 seconds | Avg time without first run: 0.00001092 seconds
Input size:      10000 | Avg time with first run: 0.00001658 seconds | Avg time without first run: 0.00001652 seconds
Input size:     100000 | Avg time with first run: 0.00005286 seconds | Avg time without first run: 0.00005286 seconds
Input size:    1000000 | Avg time with first run: 0.00021406 seconds | Avg time without first run: 0.00020699 seconds
Input size:   10000000 | Avg time with first run: 0.00105273 seconds | Avg time without first run: 0.00105112 seconds
Input size:  100000000 | Avg time with first run: 0.01051427 seconds | Avg time without first run: 0.01051234 seconds

I feel we are pushing to the limit where Python overhead could be something to worry about.

We are absolutely there already - this PR is trying to minimize the number of Python operations we're doing in the __call__ method.

@shwina
Copy link
Contributor Author

shwina commented Feb 6, 2025

/ok to test

Copy link
Contributor

github-actions bot commented Feb 6, 2025

🟥 CI finished in 6m 06s: Pass: 0%/1 | Total: 6m 06s | Avg: 6m 06s | Max: 6m 06s
  • 🟥 python: Pass: 0%/1 | Total: 6m 06s | Avg: 6m 06s | Max: 6m 06s

    🟥 cpu
      🟥 amd64              Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
    🟥 ctk
      🟥 12.8               Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
    🟥 cudacxx
      🟥 nvcc12.8           Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
    🟥 cudacxx_family
      🟥 nvcc               Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
    🟥 cxx
      🟥 GCC13              Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
    🟥 cxx_family
      🟥 GCC                Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
    🟥 gpu
      🟥 rtx2080            Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
    🟥 jobs
      🟥 Test               Pass:   0%/1   | Total:  6m 06s | Avg:  6m 06s | Max:  6m 06s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

@shwina
Copy link
Contributor Author

shwina commented Feb 6, 2025

/ok to test

Copy link
Contributor

github-actions bot commented Feb 6, 2025

🟥 CI finished in 6m 05s: Pass: 0%/1 | Total: 6m 05s | Avg: 6m 05s | Max: 6m 05s
  • 🟥 python: Pass: 0%/1 | Total: 6m 05s | Avg: 6m 05s | Max: 6m 05s

    🟥 cpu
      🟥 amd64              Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
    🟥 ctk
      🟥 12.8               Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
    🟥 cudacxx
      🟥 nvcc12.8           Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
    🟥 cudacxx_family
      🟥 nvcc               Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
    🟥 cxx
      🟥 GCC13              Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
    🟥 cxx_family
      🟥 GCC                Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
    🟥 gpu
      🟥 rtx2080            Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
    🟥 jobs
      🟥 Test               Pass:   0%/1   | Total:  6m 05s | Avg:  6m 05s | Max:  6m 05s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

@shwina
Copy link
Contributor Author

shwina commented Feb 6, 2025

/ok to test

Copy link
Contributor

github-actions bot commented Feb 6, 2025

🟩 CI finished in 33m 23s: Pass: 100%/1 | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
  • 🟩 python: Pass: 100%/1 | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 33m 23s | Avg: 33m 23s | Max: 33m 23s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

@shwina shwina force-pushed the cuda-parallel-minor-perf-improvements branch from 0f404e7 to 41f652d Compare February 7, 2025 16:38
@shwina
Copy link
Contributor Author

shwina commented Feb 7, 2025

/ok to test

Copy link
Contributor

github-actions bot commented Feb 7, 2025

🟩 CI finished in 28m 40s: Pass: 100%/1 | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
  • 🟩 python: Pass: 100%/1 | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 28m 40s | Avg: 28m 40s | Max: 28m 40s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

@shwina shwina force-pushed the cuda-parallel-minor-perf-improvements branch from 8bac88d to 2d2af2c Compare February 10, 2025 12:16
@shwina shwina force-pushed the cuda-parallel-minor-perf-improvements branch from ed71555 to 5d946c1 Compare February 10, 2025 12:41
@shwina shwina force-pushed the cuda-parallel-minor-perf-improvements branch from 5d946c1 to 0429181 Compare February 10, 2025 12:43
@shwina shwina marked this pull request as ready for review February 10, 2025 14:15
@shwina shwina requested a review from a team as a code owner February 10, 2025 14:15
@shwina shwina requested a review from alliepiper February 10, 2025 14:15
@leofang
Copy link
Member

leofang commented Feb 10, 2025

  1. Removing type validation between the calls to Reduce.__init__ and Reduce.__call__: while this removes several guardrails, I think it's appropriate. Higher level APIs can hide the Reduce object from the user altogether and ensure that there is no way to pass objects of different dtype between the calls to __init__ and __call__.

In the near future we should consider establishing an API contract for plan building and plan execution (#2429 (comment)).

we could consider changing the API to not accept __cuda_array_interface__ objects, and instead have the user pass in the required information (pointer, size, dtype, etc.,). This allows each library/user to compute that information in the most efficient way possible rather than making it our responsibility.

Let's have a separate issue to track this. Thinking about this more we should try to make the current (low-level) interface look more like a 1:1 binding to the bare C++ one. This is what we do for cuda.cooperative too. Pythonic interface can come later.

Copy link
Contributor

🟥 CI finished in 5m 44s: Pass: 0%/1 | Total: 5m 44s | Avg: 5m 44s | Max: 5m 44s
  • 🟥 python: Pass: 0%/1 | Total: 5m 44s | Avg: 5m 44s | Max: 5m 44s

    🟥 cpu
      🟥 amd64              Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
    🟥 ctk
      🟥 12.8               Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
    🟥 cudacxx
      🟥 nvcc12.8           Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
    🟥 cudacxx_family
      🟥 nvcc               Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
    🟥 cxx
      🟥 GCC13              Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
    🟥 cxx_family
      🟥 GCC                Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
    🟥 gpu
      🟥 rtx2080            Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
    🟥 jobs
      🟥 Test               Pass:   0%/1   | Total:  5m 44s | Avg:  5m 44s | Max:  5m 44s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

@shwina shwina force-pushed the cuda-parallel-minor-perf-improvements branch from 645d11c to 0429181 Compare February 10, 2025 16:13
Copy link
Contributor

🟩 CI finished in 29m 45s: Pass: 100%/1 | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
  • 🟩 python: Pass: 100%/1 | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 29m 45s | Avg: 29m 45s | Max: 29m 45s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

@oleksandr-pavlyk
Copy link
Contributor

In the function is_contiguous we do

def is_contiguous(arr: DeviceArrayLike) -> bool:
    shape, strides = get_shape(arr), get_strides(arr)

    if strides is None:
        return True

    if any(dim == 0 for dim in shape):
        # array has no elements
        return True
   [---SNIPPED--]

but we do not use shape if strides is None. So I propose to speed-up that case by changing the function to:

def is_contiguous(arr: DeviceArrayLike) -> bool:
    strides = get_strides(arr)

    if strides is None:
        return True

    shape = get_shape(arr)
    if any(dim == 0 for dim in shape):
        # array has no elements
        return True
   [---SNIPPED--]

Comment on lines 32 to 36
typestr = arr.__cuda_array_interface__["typestr"]

if typestr.startswith("|V"):
# it's a structured dtype, use the descr field:
return np.dtype(arr.__cuda_array_interface__["descr"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In certain cases we might need to access arr.__cuda_array_interface__ more than once. Given that it is expensive, why not save it to a variable and reuse the interface dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. I tried to reuse the CAI as much as possible as part of b8474a4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above discussion: #3718 (comment)

@@ -45,9 +45,9 @@ def op(a, b):
num_items = 2**num_items_pow2
h_input = random_int(num_items, dtype)
d_input = numba.cuda.to_device(h_input)
temp_storage_size = reduce_into(None, d_input, d_output, None, h_init)
temp_storage_size = reduce_into(None, d_input, d_output, len(d_input), h_init)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not have intuition for how expensive the call to len(d_input) is, but I would save it and reused the result few lines below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really quite fast, and in the context of these unit tests, storing the result in a variable wouldn't really save us much. In this case, I don't think it's necessary:

In [4]: arr = cp.asarray([1, 2, 3])

In [5]: %timeit arr.size
16.7 ns ± 0.0128 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

In [6]: %timeit len(arr)
21.2 ns ± 0.866 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

@@ -63,40 +63,15 @@ def op(a, b):
for num_items in [42, 420000]:
h_input = np.random.random(num_items) + 1j * np.random.random(num_items)
d_input = numba.cuda.to_device(h_input)
temp_storage_bytes = reduce_into(None, d_input, d_output, None, h_init)
temp_storage_bytes = reduce_into(None, d_input, d_output, len(d_input), h_init)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment. Consider saving and reusing the result of len(d_input).

BTW, is d_input necessarily viewed as 1d array? In NumPy, len(arr) gives arr.shape[0], but here we likely need arr.size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and replaced len(arr) with arr.size, here and in other places, as part of b8474a4.

@@ -22,13 +22,13 @@ def min_op(a, b):
reduce_into = algorithms.reduce_into(d_output, d_output, min_op, h_init)

# Determine temporary device storage requirements
temp_storage_size = reduce_into(None, d_input, d_output, None, h_init)
temp_storage_size = reduce_into(None, d_input, d_output, len(d_input), h_init)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider saving and reusing

Copy link
Contributor

🟩 CI finished in 34m 26s: Pass: 100%/1 | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
  • 🟩 python: Pass: 100%/1 | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 34m 26s | Avg: 34m 26s | Max: 34m 26s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

Copy link
Contributor

@oleksandr-pavlyk oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @shwina

@leofang leofang merged commit a03ce7b into NVIDIA:main Feb 11, 2025
20 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants