Skip to content

cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks #4418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 15, 2025

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Apr 11, 2025

Description

This PR makes a modification to the cuda.parallel (Python) benchmarks so that allocations (for input and output arrays) are not included in the benchmark timings.

Additionally, benchmarks for struct inputs are added.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina requested review from a team as code owners April 11, 2025 18:37
@shwina shwina requested a review from NaderAlAwar April 11, 2025 18:37
@github-project-automation github-project-automation bot moved this to Todo in CCCL Apr 11, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Apr 11, 2025
Copy link
Contributor

🟩 CI finished in 1h 27m: Pass: 100%/1 | Total: 1h 27m | Avg: 1h 27m | Max: 1h 27m
  • 🟩 python: Pass: 100%/1 | Total: 1h 27m | Avg: 1h 27m | Max: 1h 27m

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

@shwina shwina changed the title cuda.parallel: Exclude allocation times from pytest-benchmarks cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks Apr 13, 2025
Copy link
Contributor

@NaderAlAwar NaderAlAwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, noticed some inconsistency that should probably be addressed

Comment on lines 24 to 37

def merge_sort_iterator(size, build_only):
def merge_sort_iterator(size, output_keys, output_vals, build_only):
keys_dt = cp.int32
vals_dt = cp.int64
keys = iterators.CountingIterator(np.int32(0))
vals = iterators.CountingIterator(np.int64(0))
res_keys = cp.empty(size, dtype=keys_dt)
res_vals = cp.empty(size, dtype=vals_dt)
output_keys = cp.empty(size, dtype=keys_dt)
output_vals = cp.empty(size, dtype=vals_dt)

def my_cmp(a: np.int32, b: np.int32) -> np.int32:
return np.int32(a < b)

alg = algorithms.merge_sort(keys, vals, res_keys, res_vals, my_cmp)
temp_bytes = alg(None, keys, vals, res_keys, res_vals, size)
alg = algorithms.merge_sort(keys, vals, output_keys, output_vals, my_cmp)
temp_bytes = alg(None, keys, vals, output_keys, output_vals, size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we also want to exclude iterator creation times here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed so that we pass the iterator into the benchmarked function

Comment on lines +61 to +55
temp_bytes = alg(None, keys, vals, output_keys, output_vals, size)
scratch = cp.empty(temp_bytes, dtype=cp.uint8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with the above benchmarks, where you do not include the time to create temporary storage when running the algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (ditto below)

Comment on lines +38 to +39
temp_bytes = alg(None, input_array, res, size, h_init)
scratch = cp.empty(temp_bytes, dtype=cp.uint8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

temp_bytes = alg(None, d, res, size, h_init)
scratch = cp.empty(temp_bytes, dtype=cp.uint8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@shwina
Copy link
Contributor Author

shwina commented Apr 15, 2025

pre-commit.ci autofix

Copy link

copy-pr-bot bot commented Apr 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shwina
Copy link
Contributor Author

shwina commented Apr 15, 2025

/ok to test

Copy link

copy-pr-bot bot commented Apr 15, 2025

/ok to test

@shwina, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

Copy link
Contributor

🟩 CI finished in 1h 32m: Pass: 100%/1 | Total: 1h 32m | Avg: 1h 32m | Max: 1h 32m
  • 🟩 python: Pass: 100%/1 | Total: 1h 32m | Avg: 1h 32m | Max: 1h 32m

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
    🟩 ctk
      🟩 12.8               Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
    🟩 gpu
      🟩 rtx2080            Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 1)

# Runner
1 linux-amd64-gpu-rtx2080-latest-1

@shwina shwina force-pushed the python-benchmarks-reuse-allocation branch from 77049ea to d33dfba Compare April 15, 2025 16:12
Copy link
Contributor

🟩 CI finished in 17m 08s: Pass: 100%/3 | Total: 26m 07s | Avg: 8m 42s | Max: 17m 08s
  • 🟩 python: Pass: 100%/3 | Total: 26m 07s | Avg: 8m 42s | Max: 17m 08s

    🟩 cpu
      🟩 amd64              Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
    🟩 ctk
      🟩 12.8               Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
    🟩 cxx
      🟩 GCC13              Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
    🟩 jobs
      🟩 cuda.cccl          Pass: 100%/1   | Total:  2m 47s | Avg:  2m 47s | Max:  2m 47s
      🟩 cuda.cooperative   Pass: 100%/1   | Total: 17m 08s | Avg: 17m 08s | Max: 17m 08s
      🟩 cuda.parallel      Pass: 100%/1   | Total:  6m 12s | Avg:  6m 12s | Max:  6m 12s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
3 linux-amd64-gpu-rtx2080-latest-1

@shwina shwina merged commit 8d1521a into NVIDIA:main Apr 15, 2025
19 of 20 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants