Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
StreamHPC 2023-11-17 (batch memcpy) (#485)
* Implemented batch memcpy algorithm and relevant tests and benchmarks * Optimize match_any by using arithmetic shifts The compiler seems to see through these much better than the conditional, generating bit-field extract instructions, and recognizing that the loop is a reduction. * Pedantic / consistency changes for batch memcpy * Improve interface and implementation of align_(up|down) - Use the alignment of the destination type instead of its size - Rename to emphasize that this does a form of reinterpret_cast - Use the same type as the return type and template parameter, to match the interface of built-in casts - Pedantic: use uintptr_t instead of size_t for the numerical value of a pointer - Use clangs __builtin_align_(up|down) when available * Take parameters as explicit const-ref in test_utils::bit_equal Because these are templates this already works for non-copyable types, (as `T` will be deduced to `Type&`) but its confusing, and wouldn't work for r-values. Because we are comparing object representations taking a copy isn't okay as that only guarantees that the value representation is copied. (I.e. padding bytes are not required to be copied when taking a parameter by copy) * Actually make custom_non(copyable|moveable)_type non (copy|move)-able * Allow passing rocprim::default_config to batch_memcpy As all the other device functions do too. * Fix typo in cast_align_down documentation * Fixup accidentally deleted constructor of custom_non_moveable_type This was accidentally deleted, it was meant to be defaulted. Currently no test calls this as batch-memcpy tests only use this type at the device side. * Improve error message of test_rocprim_package The error message of the package test wasn't very nice, improve it for easier debugging in the future. Before: ```console ❯ ./a.out 98 ``` After: ```console ❯ ./a.out Error hipErrorInvalidDeviceFunction(98): invalid device function in main at test_rocprim_package.cpp:90 ``` * Refactor test_utils::get_random_data into generate_random_data_n - Writes the output into an output iterator instead of creating & returning a vector. This allows greater flexibility for users i.e. writing random values with differing options into the same container. - Accepts a generator instead of a seed. This is more efficient, because creating an instance of an rng engine might be costly. It's also more consistent with how the standard library operates. - The naming and interface tries to mirror the stl (i.e. `std::generate_n`) - Backwards compatibility is maintained by adding test_utils::get_random_data that uses `generate_random_data_n` internally. * Refactor get_random_data into generate_random_data_n in benchmark_utils This mirrors the test changes in the previous commit * Unify segmnented generation from test generate_random_data_n overloads * Add missing include for iterator traits to benchmark_utils * ci: use build instead rocm-build tag This allows the build job to be performed by any runner configured for building, instead of the ROCm-specialized builder. As the target architectures are specified ahead of time, the GPU is not needed during the build process, and may be performed by any builder. * fix: Fixed doxygen warning in device_memcpy_config.hpp * Speed up / Improve data-generation in test_device_batch_memcpy Do bulk data-generation instead of individual calls, especially of individual bytes for the data to copy. Also changes the verification to do bulk memcmp instead of item-wise test_utils::bit_equals for each buffer. Overall this reduces the time it takes to run the test to ~1s from around 3s. * Refactor & Speedup benchmark_device_batch_memcpy - Share the data generation between the naive and uut benchmarks - Make the data-generation be bulk using a fast random number engine (mt19937) to significantly speed it up. The overall runtime of the benchmark decreased from 14 minutes (!) to around 2 minutes. * Fix explanation comment in batch_memcpy test/benchmark * fix include order in benchmark_device_batch_memcpy * doc: add batch memcpy to changelog --------- Co-authored-by: Gergely Meszaros <[email protected]> Co-authored-by: Robin Voetter <[email protected]>
- Loading branch information