Skip to content

Commit

Permalink
Block Radix Sort improvements and Segmented Radix Sort tuning (#636)
Browse files Browse the repository at this point in the history
* fix(device_radix_sort): add missing 'ROCPRIM_IF_CONSTEXPR'

* fix(warp_sort_stable): add missing includes

* fix(benchmark_device_segmented_radix_sort_pairs): fix  really stupid compile issue

* feat(benchmark_utils): add benchmark support for 'rocprim::bfloat16'

* perf(device_radix_sort): replace 'match_any'-based counter with atomics

Atomic-based counter has better performance.

* perf(device_radix_sort): directly use radix_rank for sort_and_scatter

Performance improvement seems minimal, but perhaps this can serve
as a starting point for more optimization.

* perf(device_segmented_radix_sort): use warp_sort_stable for single-warp sorts

* perf(device_segmented_radix_sort): improve medium sort with 8 bits per pass

* feat(block_radix_sort): add override for rank algorithm

* refactor(device_segmented_radix_sort): remove short radix bits from large segments

This doesn't seem to improve anything for 8/8 bits.
TODO: Check whether it has any effect for the other
radix sizes (like 7/6), but it shouldn't really.

* perf(device_radix_sort,device_segmented_radix_sort): make 'sort_block' output striped values

This can be done more efficiently internally in the block radix sort

* fix(device_segmented_radix_sort): modify segmented warp sort to accept block size

* perf(device_segmented_radix_sort): use radix sort in combined kernel for small segments

* perf(block_radix_sort): fuse scatter of final iteration if sorting to striped

Improves the block radix sort performance when using the to_striped versions.

* feat(block_exchange): support scatter_to_warp_striped

This is needed for block radix sort with rank match

* feat(block_radix_sort): fully support radix rank match

TODO: documentation and tests

* refactor(device_segmented_radix_sort): remove short radix sort in large separate kernel

* fix(device_segmented_radix_sort): fix 'warp_sort' with only keys

* feat(autotune-search): added tool for config tuning using dual annealing

This currently breaks normal autotune compilation for device segmented radix sort benchmarks. Setting the new CMake options 'BENCHMARK_TUNE_PARAM_NAMES' and 'BENCHMARK_TUNE_PARAMS' for this algorithm is required!

* feat(device/config_types.hpp): add support for gfx942 dynamic dispatch

* refactor(device_segmented_radix_sort): deprecation of short radix bits

* fix(benchmark_device_segmented_radix_sort_*): update tuning to handle new config space

* feat(scripts/autotune): add 'gfx942' target

* perf(config/device_segmented_radix_sort): add tuned configs for gfx942

* feat(block_radix_sort): allow using block radix rank match algorithm for inputs in blocked layout and use this by default when block size is a multiple of device warp size

* perf(block_radix_sort): select higher radix bits per pass when using 'block_radix_rank_algorithm::match'

* perf(detail/block_radix_rank_match.hpp): only pad number of warps when it does not impact occupancy

This only takes occupancy limited by LDS into consideration. Register pressure is ignored.

* perf(detail/block_radix_rank_match.hpp): remove read-after-write depedency for first iteration in rank loop

* refactor(detail/block_radix_rank_match.hpp): deduplicate lds usage emulation

* perf(block_radix_sort.hpp): use warp-shuffle-based blocked to striped sub algorithm

* perf: implemented generalized block-level configs that maximize occupancy

* perf(block_radix_sort): relax block sync to wavefront sync in specific case of between key and value blocked to warp striped exchange

* perf(block_radix_sort): do internal key encoding before potential blocked to warp striped exchange to improve latency hiding

* style: formatting

* fix(block_radix_sort): fix unused parameter 'storage' warning

* fix(block_radix_sort.hpp): remove unused and broken include

* fix(device_radix_sort): fix incorrect data layout in block load for internal block radix sort

* docs(block_exchange): document 'scatter_to_warp_striped' api

* docs: fix typo

* docs(block_radix_sort): document warp_striped_to_striped and variants

* feat(block_exchange): added 'block_exchange_padding_mode' type parameter to 'block_exchange'

* refactor: generalize padding hints

* perf(detail/device_radix_sort): unroll init loop

* perf(config/device_radix_sort_block_sort): update tuning

* perf(config/device_radix_sort): tuned device_radix_sort on gfx942

* docs(changelog): update changelog

---------

Co-authored-by: Robin Voetter <[email protected]>
  • Loading branch information
Naraenda and Snektron authored Nov 18, 2024
1 parent c6b1468 commit 5182246
Show file tree
Hide file tree
Showing 28 changed files with 3,926 additions and 1,220 deletions.
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,28 @@

Full documentation for rocPRIM is available at [https://rocm.docs.amd.com/projects/rocPRIM/en/latest/](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/).


## (Unreleased) rocPRIM 3.4.0 for ROCm 6.4.0

### Added

* Added extended tests to `rtest.py`. These tests are extra tests that did not fit the criteria of smoke and regression tests. These tests will take much longer to run relative to smoke and regression tests.
* Use `python rtest.py [--emulation|-e|--test|-t]=extended` to run these tests.
* Added regression tests to `rtest.py`. Regression tests are a subset of tests that caused hardware problems for past emulation environments.
* Can be run with `python rtest.py [--emulation|-e|--test|-t]=regression`
* Added the parallel `find_first_of` device function with autotuned configurations, this function is similar to `std::find_first_of`, it searches for the first occurrence of any of the provided elements.
* Added `--emulation` option added for `rtest.py`
* Unit tests can be run with `[--emulation|-e|--test|-t]=<test_name>`
* Added tuned configurations for segmented radix sort for gfx942 to improve performance on this architecture.

### Changed

* Changed the subset of tests that are run for smoke tests such that the smoke test will complete with faster run-time and to never exceed 2GB of vram usage. Use `python rtest.py [--emulation|-e|--test|-t]=smoke` to run these tests.
* The `rtest.py` options have changed. `rtest.py` is now run with at least either `--test|-t` or `--emulation|-e`, but not both options.
* Changed the internal algorithm of block radix sort to use rank match to improve performance of various radix sort related algorithms.
* Disabled padding in various cases where higher occupancy resulted in better performance despite more bank conflicts.

### Resolved issues

* Fixed an issue where `rmake.py` would generate wrong CMAKE commands while using Linux environment
* Fixed an issue where `rocprim::partial_sort_copy` would yield a compile error if the input iterator is const.
* Fixed incorrect 128-bit signed and unsigned integers type traits.
Expand Down
9 changes: 9 additions & 0 deletions benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ option(BENCHMARK_CONFIG_TUNING "Benchmark device-level functions using various c
include(../cmake/ConfigAutotune.cmake)
include(ConfigAutotuneSettings.cmake)

option(BENCHMARK_TUNE_PARAM_NAMES "Tuning parameter names" "")
option(BENCHMARK_TUNE_PARAMS "Tuning parameters" "")

if(BENCHMARK_CONFIG_TUNING)
add_custom_target("benchmark_config_tuning")
endif()
Expand All @@ -35,6 +38,12 @@ function(add_rocprim_benchmark BENCHMARK_SOURCE)
if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/${BENCHMARK_TARGET}.parallel.cpp.in")
message(STATUS "found ${BENCHMARK_TARGET}.parallel.cpp.in file, compiling in parallel.")
read_config_autotune_settings(${BENCHMARK_TARGET} list_across_names list_across output_pattern_suffix)

if(BENCHMARK_TUNE_PARAM_NAMES AND BENCHMARK_TUNE_PARAMS)
set(list_across_names "${BENCHMARK_TUNE_PARAM_NAMES}")
set(list_across "${BENCHMARK_TUNE_PARAMS}")
endif()

#make sure that variables are not empty, i.e. there actually is an entry for that benchmark in benchmark/ConfigAutotuneSettings.cmake
if(list_across_names)
add_executable(${BENCHMARK_TARGET} ${BENCHMARK_SOURCE})
Expand Down
12 changes: 6 additions & 6 deletions benchmark/ConfigAutotuneSettings.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -83,16 +83,16 @@ binary_search upper_bound lower_bound;${TUNING_TYPES};${LIMITED_TUNING_TYPES};64
set(output_pattern_suffix "@SubAlgorithm@_@ValueType@_@OutputType@_@BlockSize@_@ItemsPerThread@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_segmented_radix_sort_keys")
set(list_across_names "\
KeyType;BlockSize;ItemsPerThread;PartitionAllowed" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};128 256;4 8 16;false" PARENT_SCOPE)
KeyType;LongBits;BlockSize;ItemsPerThread;WarpSmallLWS;WarpSmallIPT;WarpSmallBS;WarpPartition;WarpMediumLWS;WarpMediumIPT;WarpMediumBS" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};8;256;4 8 16;8;4;256;64;16;8;256" PARENT_SCOPE)
set(output_pattern_suffix "\
@KeyType@_@BlockSize@_@ItemsPerThread@_@PartitionAllowed@" PARENT_SCOPE)
@KeyType@_@LongBits@_@BlockSize@_@ItemsPerThread@_@WarpSmallLWS@_@WarpSmallIPT@_@WarpSmallBS@_@WarpPartition@_@WarpMediumLWS@_@WarpMediumIPT@_@WarpMediumBS@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_segmented_radix_sort_pairs")
set(list_across_names "\
KeyType;ValueType;BlockSize;ItemsPerThread;PartitionAllowed" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};int8_t;64;4 8 16;true false" PARENT_SCOPE)
KeyType;ValueType;LongBits;BlockSize;ItemsPerThread;WarpSmallLWS;WarpSmallIPT;WarpSmallBS;WarpPartition;WarpMediumLWS;WarpMediumIPT;WarpMediumBS" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};int8_t;8;256;4 8 16;8;4;256;64;16;8;256" PARENT_SCOPE)
set(output_pattern_suffix "\
@KeyType@_@ValueType@_@BlockSize@_@ItemsPerThread@_@PartitionAllowed@" PARENT_SCOPE)
@KeyType@_@ValueType@_@LongBits@_@BlockSize@_@ItemsPerThread@_@WarpSmallLWS@_@WarpSmallIPT@_@WarpSmallBS@_@WarpPartition@_@WarpMediumLWS@_@WarpMediumIPT@_@WarpMediumBS@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_transform")
set(list_across_names "\
DataType;BlockSize;" PARENT_SCOPE)
Expand Down
2 changes: 1 addition & 1 deletion benchmark/benchmark_device_segmented_radix_sort_keys.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ int main(int argc, char* argv[])
config_autotune_register::register_benchmark_subset(benchmarks,
parallel_instance,
parallel_instances,
size,
min_size,
seed,
stream);
#else
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// MIT License
//
// Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
Expand All @@ -27,8 +27,20 @@

namespace
{
auto benchmarks = config_autotune_register::create_bulk(device_segmented_radix_sort_benchmark_generator<@BlockSize@,
@ItemsPerThread@,
@KeyType@,
@PartitionAllowed@>::create);
auto benchmarks = config_autotune_register::create_bulk(
device_segmented_radix_sort_benchmark_generator<
@LongBits@,
0,
@BlockSize@,
@ItemsPerThread@,
@WarpSmallLWS@,
@WarpSmallIPT@,
@WarpSmallBS@,
@WarpPartition@,
@WarpMediumLWS@,
@WarpMediumIPT@,
@WarpMediumBS@,
@KeyType@,
true
>::create);
} // namespace
110 changes: 28 additions & 82 deletions benchmark/benchmark_device_segmented_radix_sort_keys.parallel.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -261,92 +261,38 @@ struct device_segmented_radix_sort_benchmark : public config_autotune_interface

template<typename Tp, template<Tp> class T, bool enable, Tp... Idx>
struct decider;
template<unsigned int BlockSize, unsigned int ItemsPerThread, typename Key, bool PartitionAllowed>

template<unsigned int LongBits,
unsigned int ShortBits,
unsigned int BlockSize,
unsigned int ItemsPerThread,
unsigned int WarpSmallLWS,
unsigned int WarpSmallIPT,
unsigned int WarpSmallBS,
unsigned int WarpPartition,
unsigned int WarpMediumLWS,
unsigned int WarpMediumIPT,
unsigned int WarpMediumBS,
typename Key,
bool UnpartitionWarpAllowed = true>
struct device_segmented_radix_sort_benchmark_generator
{
template<unsigned int LongBits>
struct create_lrb
{
template<unsigned int ShortBits>
struct create_srb
{
template<bool EnableUnpartitionedWarpSort>
struct create_euws
{
template<unsigned int LogicalWarpSizeSmall>
struct create_lwss
{
template<unsigned int PartitioningThreshold>
struct create_pt
{
void operator()(
std::vector<std::unique_ptr<config_autotune_interface>>& storage)
{
storage.emplace_back(
std::make_unique<device_segmented_radix_sort_benchmark<
Key,
rocprim::segmented_radix_sort_config<
LongBits,
ShortBits,
rocprim::kernel_config<BlockSize, ItemsPerThread>,
rocprim::WarpSortConfig<LogicalWarpSizeSmall / 2,
ItemsPerThread / 2,
BlockSize,
PartitioningThreshold,
LogicalWarpSizeSmall,
ItemsPerThread,
BlockSize>,
EnableUnpartitionedWarpSort>>>());
}
};

void
operator()(std::vector<std::unique_ptr<config_autotune_interface>>& storage)
{
static_for_each<std::integer_sequence<unsigned int, 5>, create_pt>(storage);
}
};

void operator()(std::vector<std::unique_ptr<config_autotune_interface>>& storage)
{
if(PartitionAllowed)
{

static_for_each<std::integer_sequence<unsigned int, 8, 16, 32>,
create_lwss>(storage);
}
else
{
storage.emplace_back(
std::make_unique<device_segmented_radix_sort_benchmark<
Key,
rocprim::segmented_radix_sort_config<
LongBits,
ShortBits,
rocprim::kernel_config<BlockSize, ItemsPerThread>,
rocprim::DisabledWarpSortConfig,
EnableUnpartitionedWarpSort>>>());
}
}
};

void operator()(std::vector<std::unique_ptr<config_autotune_interface>>& storage)
{
decider<bool, create_euws, 1u << ShortBits <= BlockSize, true>::do_the_thing(
storage);
}
};

void operator()(std::vector<std::unique_ptr<config_autotune_interface>>& storage)
{
decider<unsigned int, create_srb, 1u << LongBits <= BlockSize, 3, 5>::do_the_thing(
storage);
}
};

static void create(std::vector<std::unique_ptr<config_autotune_interface>>& storage)
{
static_for_each<std::integer_sequence<unsigned int, 4, 5>, create_lrb>(storage);
storage.emplace_back(std::make_unique<device_segmented_radix_sort_benchmark<
Key,
rocprim::segmented_radix_sort_config<
LongBits,
ShortBits,
rocprim::kernel_config<BlockSize, ItemsPerThread>,
rocprim::WarpSortConfig<WarpSmallLWS,
WarpSmallIPT,
WarpSmallBS,
WarpPartition,
WarpMediumLWS,
WarpMediumIPT,
WarpMediumBS>,
UnpartitionWarpAllowed>>>());
}
};

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// MIT License
//
// Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
Expand All @@ -27,9 +27,21 @@

namespace
{
auto benchmarks = config_autotune_register::create_bulk(device_segmented_radix_sort_benchmark_generator<@BlockSize@,
@ItemsPerThread@,
@KeyType@,
@ValueType@,
@PartitionAllowed@>::create);
auto benchmarks = config_autotune_register::create_bulk(
device_segmented_radix_sort_benchmark_generator<
@LongBits@,
8,
@BlockSize@,
@ItemsPerThread@,
@WarpSmallLWS@,
@WarpSmallIPT@,
@WarpSmallBS@,
@WarpPartition@,
@WarpMediumLWS@,
@WarpMediumIPT@,
@WarpMediumBS@,
@KeyType@,
@ValueType@,
true
>::create);
} // namespace
Loading

0 comments on commit 5182246

Please sign in to comment.