-
Notifications
You must be signed in to change notification settings - Fork 207
cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks #4418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks #4418
Conversation
🟩 CI finished in 1h 27m: Pass: 100%/1 | Total: 1h 27m | Avg: 1h 27m | Max: 1h 27m
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
stdpar | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
stdpar | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, noticed some inconsistency that should probably be addressed
|
||
def merge_sort_iterator(size, build_only): | ||
def merge_sort_iterator(size, output_keys, output_vals, build_only): | ||
keys_dt = cp.int32 | ||
vals_dt = cp.int64 | ||
keys = iterators.CountingIterator(np.int32(0)) | ||
vals = iterators.CountingIterator(np.int64(0)) | ||
res_keys = cp.empty(size, dtype=keys_dt) | ||
res_vals = cp.empty(size, dtype=vals_dt) | ||
output_keys = cp.empty(size, dtype=keys_dt) | ||
output_vals = cp.empty(size, dtype=vals_dt) | ||
|
||
def my_cmp(a: np.int32, b: np.int32) -> np.int32: | ||
return np.int32(a < b) | ||
|
||
alg = algorithms.merge_sort(keys, vals, res_keys, res_vals, my_cmp) | ||
temp_bytes = alg(None, keys, vals, res_keys, res_vals, size) | ||
alg = algorithms.merge_sort(keys, vals, output_keys, output_vals, my_cmp) | ||
temp_bytes = alg(None, keys, vals, output_keys, output_vals, size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we also want to exclude iterator creation times here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed so that we pass the iterator into the benchmarked function
temp_bytes = alg(None, keys, vals, output_keys, output_vals, size) | ||
scratch = cp.empty(temp_bytes, dtype=cp.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistent with the above benchmarks, where you do not include the time to create temporary storage when running the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (ditto below)
temp_bytes = alg(None, input_array, res, size, h_init) | ||
scratch = cp.empty(temp_bytes, dtype=cp.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
temp_bytes = alg(None, d, res, size, h_init) | ||
scratch = cp.empty(temp_bytes, dtype=cp.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
pre-commit.ci autofix |
/ok to test |
@shwina, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
🟩 CI finished in 1h 32m: Pass: 100%/1 | Total: 1h 32m | Avg: 1h 32m | Max: 1h 32m
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
stdpar | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
stdpar | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 1)
# | Runner |
---|---|
1 | linux-amd64-gpu-rtx2080-latest-1 |
77049ea
to
d33dfba
Compare
🟩 CI finished in 17m 08s: Pass: 100%/3 | Total: 26m 07s | Avg: 8m 42s | Max: 17m 08s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
stdpar | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
stdpar | |
+/- | python |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 3)
# | Runner |
---|---|
3 | linux-amd64-gpu-rtx2080-latest-1 |
Description
This PR makes a modification to the cuda.parallel (Python) benchmarks so that allocations (for input and output arrays) are not included in the benchmark timings.
Additionally, benchmarks for struct inputs are added.
Checklist