Skip to content

Commit

Permalink
StreamHPC 2023-11-17 (batch memcpy) (#485)
Browse files Browse the repository at this point in the history
* Implemented batch memcpy algorithm and relevant tests and benchmarks

* Optimize match_any by using arithmetic shifts

The compiler seems to see through these much better than the conditional,
generating bit-field extract instructions, and recognizing that the loop
is a reduction.

* Pedantic / consistency changes for batch memcpy

* Improve interface and implementation of align_(up|down)

- Use the alignment of the destination type instead of its size
- Rename to emphasize that this does a form of reinterpret_cast
- Use the same type as the return type and template parameter, to
  match the interface of built-in casts
- Pedantic: use uintptr_t instead of size_t for the numerical value
  of a pointer
- Use clangs __builtin_align_(up|down) when available

* Take parameters as explicit const-ref in test_utils::bit_equal

Because these are templates this already works for non-copyable types,
(as `T` will be deduced to `Type&`) but its confusing, and wouldn't work
for r-values. Because we are comparing object representations taking a copy
isn't okay as that only guarantees that the value representation is copied.
(I.e. padding bytes are not required to be copied when taking a parameter
 by copy)

* Actually make custom_non(copyable|moveable)_type non (copy|move)-able

* Allow passing rocprim::default_config to batch_memcpy

As all the other device functions do too.

* Fix typo in cast_align_down documentation

* Fixup accidentally deleted constructor of custom_non_moveable_type

This was accidentally deleted, it was meant to be defaulted.
Currently no test calls this as batch-memcpy tests only use this type
at the device side.

* Improve error message of test_rocprim_package

The error message of the package test wasn't very nice, improve it
for easier debugging in the future.

Before:
```console
❯ ./a.out
98
```

After:
```console
❯ ./a.out
Error hipErrorInvalidDeviceFunction(98): invalid device function in main at test_rocprim_package.cpp:90
```

* Refactor test_utils::get_random_data into generate_random_data_n

- Writes the output into an output iterator instead of creating &
  returning a vector. This allows greater flexibility for users
  i.e. writing random values with differing options into the same
  container.
- Accepts a generator instead of a seed. This is more efficient, because
  creating an instance of an rng engine might be costly. It's also
  more consistent with how the standard library operates.
- The naming and interface tries to mirror the stl (i.e. `std::generate_n`)
- Backwards compatibility is maintained by adding test_utils::get_random_data
  that uses `generate_random_data_n` internally.

* Refactor get_random_data into generate_random_data_n in benchmark_utils

This mirrors the test changes in the previous commit

* Unify segmnented generation from test generate_random_data_n overloads

* Add missing include for iterator traits to benchmark_utils

* ci: use build instead rocm-build tag

This allows the build job to be performed by any runner configured
for building, instead of the ROCm-specialized builder. As the
target architectures are specified ahead of time, the GPU is not
needed during the build process, and may be performed by any builder.

* fix: Fixed doxygen warning in device_memcpy_config.hpp

* Speed up / Improve data-generation in test_device_batch_memcpy

Do bulk data-generation instead of individual calls, especially of
individual bytes for the data to copy.
Also changes the verification to do bulk memcmp instead of item-wise
test_utils::bit_equals for each buffer.
Overall this reduces the time it takes to run the test to ~1s from
around 3s.

* Refactor & Speedup benchmark_device_batch_memcpy

- Share the data generation between the naive and uut benchmarks
- Make the data-generation be bulk using a fast random number engine
  (mt19937) to significantly speed it up.

The overall runtime of the benchmark decreased from 14 minutes (!) to
around 2 minutes.

* Fix explanation comment in batch_memcpy test/benchmark

* fix include order in benchmark_device_batch_memcpy

* doc: add batch memcpy to changelog

---------

Co-authored-by: Gergely Meszaros <[email protected]>
Co-authored-by: Robin Voetter <[email protected]>
  • Loading branch information
3 people authored Nov 21, 2023
1 parent f2347ab commit 8118351
Show file tree
Hide file tree
Showing 20 changed files with 2,666 additions and 308 deletions.
14 changes: 7 additions & 7 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ clang-format:
stage: lint
needs: []
tags:
- rocm-build
- build
variables:
CLANG_FORMAT: "/opt/rocm/llvm/bin/clang-format"
GIT_CLANG_FORMAT: "/opt/rocm/llvm/bin/git-clang-format"
Expand All @@ -66,7 +66,7 @@ copyright-date:
stage: lint
needs: []
tags:
- rocm-build
- build
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
script:
Expand Down Expand Up @@ -101,7 +101,7 @@ copyright-date:
.build:vcpkg-apt:
stage: build
tags:
- rocm-build
- build
extends:
- .gpus:rocm-gpus
- .rules:build
Expand Down Expand Up @@ -157,7 +157,7 @@ build:cmake-minimum-apt:
.build:common:
stage: build
tags:
- rocm-build
- build
extends:
- .gpus:rocm-gpus
- .rules:build
Expand Down Expand Up @@ -207,7 +207,7 @@ build:package:
stage: build
needs: []
tags:
- rocm-build
- build
extends:
- .cmake-minimum
- .gpus:rocm-gpus
Expand All @@ -232,7 +232,7 @@ build:benchmark:
stage: build
needs: []
tags:
- rocm-build
- build
extends:
- .cmake-minimum
- .gpus:rocm-gpus
Expand Down Expand Up @@ -264,7 +264,7 @@ autotune:build:
stage: autotune
needs: []
tags:
- rocm-build
- build
extends:
- .cmake-minimum
- .gpus:rocm-gpus
Expand Down
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@ Full documentation for rocPRIM is available at [https://rocprim.readthedocs.io/e
## [Unreleased rocPRIM-3.0.0 for ROCm 6.1.0]
### Added
- Added new primitive: `block_run_length_decode`.
- Added new primitive: `batch_memcpy`.
### Changed
- Removed deprecated functionality: `reduce_by_key_config`, `MatchAny`, `scan_config`, `scan_by_key_config` and `radix_sort_config`.
- Renamed `scan_config_v2` to `scan_config`, `scan_by_key_config_v2` to `scan_by_key_config`, `radix_sort_config_v2` to `radix_sort_config`, `reduce_by_key_config_v2` to `reduce_by_key_config`, `radix_sort_config_v2` to `radix_sort_config`.
- Removed support for custom config types for device algorithms.
- `host_warp_size()` was moved into `rocprim/device/config_types.hpp`, and now uses either a `device_id` or a `stream` parameter to query the proper device and a `device_id` out parameter. The return type is `hipError_t`.
- Added support for __int128_t in `device_radix_sort` and `block_radix_sort`.
- Improved the performance of `match_any` and `block_histogram` that uses it
### Fixed
- Fixed build issues with `rmake.py` on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.

Expand Down
1 change: 1 addition & 0 deletions benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ add_rocprim_benchmark(benchmark_block_scan.cpp)
add_rocprim_benchmark(benchmark_block_sort.cpp)
add_rocprim_benchmark(benchmark_config_dispatch.cpp)
add_rocprim_benchmark(benchmark_device_adjacent_difference.cpp)
add_rocprim_benchmark(benchmark_device_batch_memcpy.cpp)
add_rocprim_benchmark(benchmark_device_binary_search.cpp)
add_rocprim_benchmark(benchmark_device_histogram.cpp)
add_rocprim_benchmark(benchmark_device_merge.cpp)
Expand Down
Loading

0 comments on commit 8118351

Please sign in to comment.