Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Excessive shared memory usage in block_shuffle fix * remove block_sort_algorithm template param from block_sort_kernel_impl and block_sort_impl * fixed compile errors * Updated ChangeLog.md * remove unnecessary code * fixed CHANGELOG.md to not be so verbose about non public api changes * Add dynamic dispatch and autotuning to device_adjacent_difference * Fix device_adjacent_difference storage type * ci: remove autotune dependency from build:benchmark The workaround needed to make this work is has major disadvantages, and our current workflow does not make use of this dependency anyway (Currently the generated configs are checked into the repository, so the CI would run the benchmarks on them on the next push to the merge-request). When we improve automation around autotuning this could be implemented with conditional jobs, but lets just drop the dependency for now. * test: fix indexing error test_type_helper<custom_16aligned>::get_random_data Indexing was 4 based when the type has 3 variables, therefore it was overflowing. Caught with address sanitizer. * fixes for compilation in debug for radix_sort - Add force inline to onesweep kernel, to avoid too much shared memory errors - Declare `block_radix_sort::radix_bits_per_pass` to fix linker errors * fix: Detect DPP & DPP broadcast support with __GFX<GENERATION>__ macros The amdgpu target in clang now provides the GFX generation as a predefined macro, so we no longer need to explicitly list all targets, which was bad for maintenance. Also replace the use of the generic `ROCPRIM_NAVI` which signals navi support, with `ROCPRIM_DETAIL_HAS_DPP_BROADCAST`, a macro that explicitly states what we're after. Also also makes sure that `ROCPRIM_DETAIL_USE_DPP` is always defined (to 0 when DPP is disabled), previously it was undefined when `ROCPRIM_DISABLE_DPP` was set. * refactor: Use __GFX<GENERATION>__ to detect NAVI cards * docs: Update CHANGELOG for DPP & ROCPRIM_NAVI fixes * remove deprecated structs and functions * rename scan_by_key_config_v2 to scan_by_key_config remove the option to use custom implemented config for scan_by_key update tests to not use custom implemented config for scan_by_key * remove the option to use custom implemented config for histogram update tests to not use custom implemented config for histogram * update config compile time check to a different pattern * update documentation comments for configs * change documentation comments * change documentation comments on device_radix_sort rename radix_sort_config_v2 to radix_sort_config * change documentation comment add static_assert to check type for reduce_config * update documentation comments remove wrap_scan_config function add static_assert to disallow custom scan_config type rename scan_config_v2 to scan_config * update documentation comments * update documentation comments make transform_config inherit from detail::transfomr_config_params remove wrap_transform_config add static assert to test for Config type in device_transform * remove wrap_adjacent_difference_config function add static_assert to test config type create default ctor for adjacent_difference_config * add missing transform_config ctor rewrite adjacent_difference_config ctor to match other config structs * fix binary search still using wrap_transform_config * implement static_asset to make binary_search only use binary search configs, but also work with the underlying transform * update changelog * remove some *_v2s that went under the radar * remove unnecessary default values * Add binary search, lower_bound and upper_bound documentation * host_warp_size() is replaced with two different versions with parameters. the new versions use either a device id or a stream to figure out the warp size of the device * comment out unused param names * fix typos in the documentation * move host_warp_size to config_type.hpp changed host_warp_size signatures to fit other similar functions * add error checks to host_warp_size calls in tests and benchmarks * fix format * add missing comment * fix error handling in lookback_scan_state.hpp * fix compilation error * change block_radix_rank_match and block_histogram_atomic to use rocprim::match_any instead of implementing same functionality * change radix_digit_count_helper to use rocprim::match_any instead of implementing same functionality added predicate param to rocprim::match_any to set invalid lanes and added tests for this functionality * add elect function to warp intrinsics add test for elect change block_histogram_atomic, block_radix_rank_match, device_histogram, device_radix_sort to use elect instead of copy-paste code * update match_any to return 0 when predicate is false * fix the bit check in elect function * update changelog.md * fix hard coded warps per block value to come from param in kernel * remove unused variables * fix review comments minor name changes update test update comments * update group_elect test tests multiple groups per warp doesn't check which exact thread is elected in a group, only that one is elected * remove unnecessary comments * remove expected from group_elect test fix compile error * fix overindexing * fix review comments update group_elect_test to have better coverage * format * fix review comments * fix perf regression * undo group_elect in block_histogram_atomic.hpp, because of perf impact * fix bad func name in CHANGELOG.md * fix merge errors * Fix reduce_by_key algorithm so keys[0] is not flagged as a new run when is nan * make device_radix_sort compatible with compiler provided __int128_t and __uint128_t * add ifdefs to only compile int128 parts on clang/gcc * update changelog * fix for int128 to_string labdas * add test for block_radix_sort int128 support * Implement block run length decode * Fix reduce_by_key algorithm so out of bounds items are not flagged as new runs for NaNs * Add reduce_by_key test to check that flagging is correct when keys are all different * Fix performance regression observed during tuning for gfx1030 and gfx1102 * Block Runlength Decode: Fix incorrect offsets and improve test * Remove duplicate key from .clang-format * Remove additional duplicates from clang-format * Fix binary_search upper/lower_bound config tuning Use specialized configurations for upper, lower, and binary search algorithms when preforming tuning * unify language around config params in documentation * Make the autotune build job run nightly * remove radix_sort_onesweep autotuning workaround * Resolve doxygen warnings for upstream PR * Enable get_device_from_stream for Windows * Use _ENABLE_EXTENDED_ALIGNED_STORAGE for windows build in rmake.py * Bump unreleased ROCm version --------- Co-authored-by: Ivan Siutsou <[email protected]> Co-authored-by: Bence Parajdi <[email protected]> Co-authored-by: Bálint Soproni <[email protected]> Co-authored-by: Gergely Meszaros <[email protected]> Co-authored-by: Beatriz Navidad Vilches <[email protected]> Co-authored-by: Mátyás Aradi <[email protected]>
- Loading branch information