-
Notifications
You must be signed in to change notification settings - Fork 69
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Block Radix Sort improvements and Segmented Radix Sort tuning (#636)
* fix(device_radix_sort): add missing 'ROCPRIM_IF_CONSTEXPR' * fix(warp_sort_stable): add missing includes * fix(benchmark_device_segmented_radix_sort_pairs): fix really stupid compile issue * feat(benchmark_utils): add benchmark support for 'rocprim::bfloat16' * perf(device_radix_sort): replace 'match_any'-based counter with atomics Atomic-based counter has better performance. * perf(device_radix_sort): directly use radix_rank for sort_and_scatter Performance improvement seems minimal, but perhaps this can serve as a starting point for more optimization. * perf(device_segmented_radix_sort): use warp_sort_stable for single-warp sorts * perf(device_segmented_radix_sort): improve medium sort with 8 bits per pass * feat(block_radix_sort): add override for rank algorithm * refactor(device_segmented_radix_sort): remove short radix bits from large segments This doesn't seem to improve anything for 8/8 bits. TODO: Check whether it has any effect for the other radix sizes (like 7/6), but it shouldn't really. * perf(device_radix_sort,device_segmented_radix_sort): make 'sort_block' output striped values This can be done more efficiently internally in the block radix sort * fix(device_segmented_radix_sort): modify segmented warp sort to accept block size * perf(device_segmented_radix_sort): use radix sort in combined kernel for small segments * perf(block_radix_sort): fuse scatter of final iteration if sorting to striped Improves the block radix sort performance when using the to_striped versions. * feat(block_exchange): support scatter_to_warp_striped This is needed for block radix sort with rank match * feat(block_radix_sort): fully support radix rank match TODO: documentation and tests * refactor(device_segmented_radix_sort): remove short radix sort in large separate kernel * fix(device_segmented_radix_sort): fix 'warp_sort' with only keys * feat(autotune-search): added tool for config tuning using dual annealing This currently breaks normal autotune compilation for device segmented radix sort benchmarks. Setting the new CMake options 'BENCHMARK_TUNE_PARAM_NAMES' and 'BENCHMARK_TUNE_PARAMS' for this algorithm is required! * feat(device/config_types.hpp): add support for gfx942 dynamic dispatch * refactor(device_segmented_radix_sort): deprecation of short radix bits * fix(benchmark_device_segmented_radix_sort_*): update tuning to handle new config space * feat(scripts/autotune): add 'gfx942' target * perf(config/device_segmented_radix_sort): add tuned configs for gfx942 * feat(block_radix_sort): allow using block radix rank match algorithm for inputs in blocked layout and use this by default when block size is a multiple of device warp size * perf(block_radix_sort): select higher radix bits per pass when using 'block_radix_rank_algorithm::match' * perf(detail/block_radix_rank_match.hpp): only pad number of warps when it does not impact occupancy This only takes occupancy limited by LDS into consideration. Register pressure is ignored. * perf(detail/block_radix_rank_match.hpp): remove read-after-write depedency for first iteration in rank loop * refactor(detail/block_radix_rank_match.hpp): deduplicate lds usage emulation * perf(block_radix_sort.hpp): use warp-shuffle-based blocked to striped sub algorithm * perf: implemented generalized block-level configs that maximize occupancy * perf(block_radix_sort): relax block sync to wavefront sync in specific case of between key and value blocked to warp striped exchange * perf(block_radix_sort): do internal key encoding before potential blocked to warp striped exchange to improve latency hiding * style: formatting * fix(block_radix_sort): fix unused parameter 'storage' warning * fix(block_radix_sort.hpp): remove unused and broken include * fix(device_radix_sort): fix incorrect data layout in block load for internal block radix sort * docs(block_exchange): document 'scatter_to_warp_striped' api * docs: fix typo * docs(block_radix_sort): document warp_striped_to_striped and variants * feat(block_exchange): added 'block_exchange_padding_mode' type parameter to 'block_exchange' * refactor: generalize padding hints * perf(detail/device_radix_sort): unroll init loop * perf(config/device_radix_sort_block_sort): update tuning * perf(config/device_radix_sort): tuned device_radix_sort on gfx942 * docs(changelog): update changelog --------- Co-authored-by: Robin Voetter <[email protected]>
- Loading branch information
Showing
28 changed files
with
3,926 additions
and
1,220 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.