cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks #4418

shwina · 2025-04-11T18:37:46Z

Description

This PR makes a modification to the cuda.parallel (Python) benchmarks so that allocations (for input and output arrays) are not included in the benchmark timings.

Additionally, benchmarks for struct inputs are added.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

github-actions · 2025-04-12T00:33:30Z

🟩 CI finished in 1h 27m: Pass: 100%/1 | Total: 1h 27m | Avg: 1h 27m | Max: 1h 27m

🟩 python: Pass: 100%/1 | Total: 1h 27m | Avg: 1h 27m | Max: 1h 27m

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total:  1h 27m | Avg:  1h 27m | Max:  1h 27m

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
	stdpar
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
	stdpar
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

NaderAlAwar

Looks good, noticed some inconsistency that should probably be addressed

NaderAlAwar · 2025-04-14T13:02:56Z

python/cuda_parallel/benchmarks/bench_merge_sort.py


-def merge_sort_iterator(size, build_only):
+def merge_sort_iterator(size, output_keys, output_vals, build_only):
    keys_dt = cp.int32
    vals_dt = cp.int64
    keys = iterators.CountingIterator(np.int32(0))
    vals = iterators.CountingIterator(np.int64(0))
-    res_keys = cp.empty(size, dtype=keys_dt)
-    res_vals = cp.empty(size, dtype=vals_dt)
+    output_keys = cp.empty(size, dtype=keys_dt)
+    output_vals = cp.empty(size, dtype=vals_dt)

    def my_cmp(a: np.int32, b: np.int32) -> np.int32:
        return np.int32(a < b)

-    alg = algorithms.merge_sort(keys, vals, res_keys, res_vals, my_cmp)
-    temp_bytes = alg(None, keys, vals, res_keys, res_vals, size)
+    alg = algorithms.merge_sort(keys, vals, output_keys, output_vals, my_cmp)
+    temp_bytes = alg(None, keys, vals, output_keys, output_vals, size)


Wouldn't we also want to exclude iterator creation times here?

Changed so that we pass the iterator into the benchmarked function

NaderAlAwar · 2025-04-14T13:06:36Z

python/cuda_parallel/benchmarks/bench_merge_sort.py

+        temp_bytes = alg(None, keys, vals, output_keys, output_vals, size)
+        scratch = cp.empty(temp_bytes, dtype=cp.uint8)


This is inconsistent with the above benchmarks, where you do not include the time to create temporary storage when running the algorithm.

Fixed (ditto below)

NaderAlAwar · 2025-04-14T13:09:16Z

python/cuda_parallel/benchmarks/bench_reduce.py

+        temp_bytes = alg(None, input_array, res, size, h_init)
+        scratch = cp.empty(temp_bytes, dtype=cp.uint8)


same as above

NaderAlAwar · 2025-04-14T13:09:29Z

python/cuda_parallel/benchmarks/bench_reduce.py

+        temp_bytes = alg(None, d, res, size, h_init)
+        scratch = cp.empty(temp_bytes, dtype=cp.uint8)


same as above

shwina · 2025-04-15T09:39:23Z

pre-commit.ci autofix

copy-pr-bot · 2025-04-15T09:40:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shwina · 2025-04-15T09:41:53Z

/ok to test

copy-pr-bot · 2025-04-15T09:41:56Z

/ok to test

@shwina, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

github-actions · 2025-04-15T14:33:46Z

🟩 CI finished in 1h 32m: Pass: 100%/1 | Total: 1h 32m | Avg: 1h 32m | Max: 1h 32m

🟩 python: Pass: 100%/1 | Total: 1h 32m | Avg: 1h 32m | Max: 1h 32m

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
🟩 ctk
  🟩 12.8               Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
🟩 gpu
  🟩 rtx2080            Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total:  1h 32m | Avg:  1h 32m | Max:  1h 32m

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
	stdpar
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
	stdpar
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 1)

#	Runner
1	`linux-amd64-gpu-rtx2080-latest-1`

github-actions · 2025-04-15T16:32:08Z

🟩 CI finished in 17m 08s: Pass: 100%/3 | Total: 26m 07s | Avg: 8m 42s | Max: 17m 08s

🟩 python: Pass: 100%/3 | Total: 26m 07s | Avg: 8m 42s | Max: 17m 08s

🟩 cpu
  🟩 amd64              Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
🟩 ctk
  🟩 12.8               Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
🟩 cudacxx
  🟩 nvcc12.8           Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
🟩 cxx
  🟩 GCC13              Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
🟩 cxx_family
  🟩 GCC                Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
🟩 gpu
  🟩 rtx2080            Pass: 100%/3   | Total: 26m 07s | Avg:  8m 42s | Max: 17m 08s
🟩 jobs
  🟩 cuda.cccl          Pass: 100%/1   | Total:  2m 47s | Avg:  2m 47s | Max:  2m 47s
  🟩 cuda.cooperative   Pass: 100%/1   | Total: 17m 08s | Avg: 17m 08s | Max: 17m 08s
  🟩 cuda.parallel      Pass: 100%/1   | Total:  6m 12s | Avg:  6m 12s | Max:  6m 12s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
	stdpar
+/-	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
	CUB
	Thrust
	CUDA Experimental
	stdpar
+/-	python
	CCCL C Parallel Library
	Catch2Helper

🏃‍ Runner counts (total jobs: 3)

#	Runner
3	`linux-amd64-gpu-rtx2080-latest-1`

shwina requested review from a team as code owners April 11, 2025 18:37

shwina requested a review from NaderAlAwar April 11, 2025 18:37

github-project-automation bot added this to CCCL Apr 11, 2025

github-project-automation bot moved this to Todo in CCCL Apr 11, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Apr 11, 2025

shwina changed the title ~~cuda.parallel: Exclude allocation times from pytest-benchmarks~~ cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks Apr 13, 2025

NaderAlAwar approved these changes Apr 14, 2025

View reviewed changes

shwina and others added 8 commits April 15, 2025 12:12

Reuse allocations in pytest benchmarks

fc005c0

Use empty, not zeros

4bf66e2

Update benchmarks

4338541

Don't include temporary allocation in compile benchmark

2d2c9d2

Don't include temporary allocation in compile benchmark

946ef13

Update bench_merge_sort.py

d061a31

Update bench_reduce.py

d8d2d17

[pre-commit.ci] auto code formatting

d33dfba

shwina force-pushed the python-benchmarks-reuse-allocation branch from 77049ea to d33dfba Compare April 15, 2025 16:12

shwina merged commit 8d1521a into NVIDIA:main Apr 15, 2025
19 of 20 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks #4418

cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks #4418

shwina commented Apr 11, 2025 •

edited

Loading

github-actions bot commented Apr 12, 2025

🟩 python: Pass: 100%/1 | Total: 1h 27m | Avg: 1h 27m | Max: 1h 27m

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

NaderAlAwar left a comment

NaderAlAwar Apr 14, 2025

shwina Apr 15, 2025

NaderAlAwar Apr 14, 2025

shwina Apr 15, 2025

NaderAlAwar Apr 14, 2025

NaderAlAwar Apr 14, 2025

shwina commented Apr 15, 2025

copy-pr-bot bot commented Apr 15, 2025

shwina commented Apr 15, 2025

copy-pr-bot bot commented Apr 15, 2025

github-actions bot commented Apr 15, 2025

🟩 python: Pass: 100%/1 | Total: 1h 32m | Avg: 1h 32m | Max: 1h 32m

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

github-actions bot commented Apr 15, 2025

🟩 python: Pass: 100%/3 | Total: 26m 07s | Avg: 8m 42s | Max: 17m 08s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

		temp_bytes = alg(None, keys, vals, output_keys, output_vals, size)
		scratch = cp.empty(temp_bytes, dtype=cp.uint8)

		temp_bytes = alg(None, input_array, res, size, h_init)
		scratch = cp.empty(temp_bytes, dtype=cp.uint8)

		temp_bytes = alg(None, d, res, size, h_init)
		scratch = cp.empty(temp_bytes, dtype=cp.uint8)

cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks #4418

cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks #4418

Conversation

shwina commented Apr 11, 2025 • edited Loading

Description

Checklist

github-actions bot commented Apr 12, 2025

🟩 python: Pass: 100%/1 | Total: 1h 27m | Avg: 1h 27m | Max: 1h 27m

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

NaderAlAwar left a comment

Choose a reason for hiding this comment

NaderAlAwar Apr 14, 2025

Choose a reason for hiding this comment

shwina Apr 15, 2025

Choose a reason for hiding this comment

NaderAlAwar Apr 14, 2025

Choose a reason for hiding this comment

shwina Apr 15, 2025

Choose a reason for hiding this comment

NaderAlAwar Apr 14, 2025

Choose a reason for hiding this comment

NaderAlAwar Apr 14, 2025

Choose a reason for hiding this comment

shwina commented Apr 15, 2025

copy-pr-bot bot commented Apr 15, 2025

shwina commented Apr 15, 2025

copy-pr-bot bot commented Apr 15, 2025

github-actions bot commented Apr 15, 2025

🟩 python: Pass: 100%/1 | Total: 1h 32m | Avg: 1h 32m | Max: 1h 32m

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 1)

github-actions bot commented Apr 15, 2025

🟩 python: Pass: 100%/3 | Total: 26m 07s | Avg: 8m 42s | Max: 17m 08s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 3)

shwina commented Apr 11, 2025 •

edited

Loading