Block Radix Sort improvements and Segmented Radix Sort tuning (#636)

* fix(device_radix_sort): add missing 'ROCPRIM_IF_CONSTEXPR' * fix(warp_sort_stable): add missing includes * fix(benchmark_device_segmented_radix_sort_pairs): fix really stupid compile issue * feat(benchmark_utils): add benchmark support for 'rocprim::bfloat16' * perf(device_radix_sort): replace 'match_any'-based counter with atomics Atomic-based counter has better performance. * perf(device_radix_sort): directly use radix_rank for sort_and_scatter Performance improvement seems minimal, but perhaps this can serve as a starting point for more optimization. * perf(device_segmented_radix_sort): use warp_sort_stable for single-warp sorts * perf(device_segmented_radix_sort): improve medium sort with 8 bits per pass * feat(block_radix_sort): add override for rank algorithm * refactor(device_segmented_radix_sort): remove short radix bits from large segments This doesn't seem to improve anything for 8/8 bits. TODO: Check whether it has any effect for the other radix sizes (like 7/6), but it shouldn't really. * perf(device_radix_sort,device_segmented_radix_sort): make 'sort_block' output striped values This can be done more efficiently internally in the block radix sort * fix(device_segmented_radix_sort): modify segmented warp sort to accept block size * perf(device_segmented_radix_sort): use radix sort in combined kernel for small segments * perf(block_radix_sort): fuse scatter of final iteration if sorting to striped Improves the block radix sort performance when using the to_striped versions. * feat(block_exchange): support scatter_to_warp_striped This is needed for block radix sort with rank match * feat(block_radix_sort): fully support radix rank match TODO: documentation and tests * refactor(device_segmented_radix_sort): remove short radix sort in large separate kernel * fix(device_segmented_radix_sort): fix 'warp_sort' with only keys * feat(autotune-search): added tool for config tuning using dual annealing This currently breaks normal autotune compilation for device segmented radix sort benchmarks. Setting the new CMake options 'BENCHMARK_TUNE_PARAM_NAMES' and 'BENCHMARK_TUNE_PARAMS' for this algorithm is required! * feat(device/config_types.hpp): add support for gfx942 dynamic dispatch * refactor(device_segmented_radix_sort): deprecation of short radix bits * fix(benchmark_device_segmented_radix_sort_*): update tuning to handle new config space * feat(scripts/autotune): add 'gfx942' target * perf(config/device_segmented_radix_sort): add tuned configs for gfx942 * feat(block_radix_sort): allow using block radix rank match algorithm for inputs in blocked layout and use this by default when block size is a multiple of device warp size * perf(block_radix_sort): select higher radix bits per pass when using 'block_radix_rank_algorithm::match' * perf(detail/block_radix_rank_match.hpp): only pad number of warps when it does not impact occupancy This only takes occupancy limited by LDS into consideration. Register pressure is ignored. * perf(detail/block_radix_rank_match.hpp): remove read-after-write depedency for first iteration in rank loop * refactor(detail/block_radix_rank_match.hpp): deduplicate lds usage emulation * perf(block_radix_sort.hpp): use warp-shuffle-based blocked to striped sub algorithm * perf: implemented generalized block-level configs that maximize occupancy * perf(block_radix_sort): relax block sync to wavefront sync in specific case of between key and value blocked to warp striped exchange * perf(block_radix_sort): do internal key encoding before potential blocked to warp striped exchange to improve latency hiding * style: formatting * fix(block_radix_sort): fix unused parameter 'storage' warning * fix(block_radix_sort.hpp): remove unused and broken include * fix(device_radix_sort): fix incorrect data layout in block load for internal block radix sort * docs(block_exchange): document 'scatter_to_warp_striped' api * docs: fix typo * docs(block_radix_sort): document warp_striped_to_striped and variants * feat(block_exchange): added 'block_exchange_padding_mode' type parameter to 'block_exchange' * refactor: generalize padding hints * perf(detail/device_radix_sort): unroll init loop * perf(config/device_radix_sort_block_sort): update tuning * perf(config/device_radix_sort): tuned device_radix_sort on gfx942 * docs(changelog): update changelog --------- Co-authored-by: Robin Voetter <[email protected]>
ROCm · Nov 18, 2024 · 5182246 · 5182246
1 parent c6b1468
commit 5182246
Show file tree

Hide file tree

Showing 28 changed files with 3,926 additions and 1,220 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,23 +2,28 @@
 
 Full documentation for rocPRIM is available at [https://rocm.docs.amd.com/projects/rocPRIM/en/latest/](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/).
 
-
 ## (Unreleased) rocPRIM 3.4.0 for ROCm 6.4.0
 
 ### Added
+
 * Added extended tests to `rtest.py`. These tests are extra tests that did not fit the criteria of smoke and regression tests. These tests will take much longer to run relative to smoke and regression tests.
  * Use `python rtest.py [--emulation|-e|--test|-t]=extended` to run these tests.
 * Added regression tests to `rtest.py`. Regression tests are a subset of tests that caused hardware problems for past emulation environments.
   * Can be run with `python rtest.py [--emulation|-e|--test|-t]=regression`
 * Added the parallel `find_first_of` device function with autotuned configurations, this function is similar to `std::find_first_of`, it searches for the first occurrence of any of the provided elements.
 * Added `--emulation` option added for `rtest.py`
   * Unit tests can be run with `[--emulation|-e|--test|-t]=<test_name>`
+* Added tuned configurations for segmented radix sort for gfx942 to improve performance on this architecture.
 
 ### Changed
+
 * Changed the subset of tests that are run for smoke tests such that the smoke test will complete with faster run-time and to never exceed 2GB of vram usage. Use `python rtest.py [--emulation|-e|--test|-t]=smoke` to run these tests.
 * The `rtest.py` options have changed. `rtest.py` is now run with at least either `--test|-t` or `--emulation|-e`, but not both options.
+* Changed the internal algorithm of block radix sort to use rank match to improve performance of various radix sort related algorithms.
+* Disabled padding in various cases where higher occupancy resulted in better performance despite more bank conflicts.
 
 ### Resolved issues
+
 * Fixed an issue where `rmake.py` would generate wrong CMAKE commands while using Linux environment
 * Fixed an issue where `rocprim::partial_sort_copy` would yield a compile error if the input iterator is const.
 * Fixed incorrect 128-bit signed and unsigned integers type traits.

diff --git a/benchmark/CMakeLists.txt b/benchmark/CMakeLists.txt
@@ -24,6 +24,9 @@ option(BENCHMARK_CONFIG_TUNING "Benchmark device-level functions using various c
 include(../cmake/ConfigAutotune.cmake)
 include(ConfigAutotuneSettings.cmake)
 
+option(BENCHMARK_TUNE_PARAM_NAMES "Tuning parameter names" "")
+option(BENCHMARK_TUNE_PARAMS "Tuning parameters" "")
+
 if(BENCHMARK_CONFIG_TUNING)
   add_custom_target("benchmark_config_tuning")
 endif()
@@ -35,6 +38,12 @@ function(add_rocprim_benchmark BENCHMARK_SOURCE)
     if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/${BENCHMARK_TARGET}.parallel.cpp.in")
       message(STATUS "found ${BENCHMARK_TARGET}.parallel.cpp.in file, compiling in parallel.")
       read_config_autotune_settings(${BENCHMARK_TARGET} list_across_names list_across output_pattern_suffix)
+
+      if(BENCHMARK_TUNE_PARAM_NAMES AND BENCHMARK_TUNE_PARAMS)
+        set(list_across_names "${BENCHMARK_TUNE_PARAM_NAMES}")
+        set(list_across "${BENCHMARK_TUNE_PARAMS}")
+      endif()
+
       #make sure that variables are not empty, i.e. there actually is an entry for that benchmark in benchmark/ConfigAutotuneSettings.cmake
       if(list_across_names)
         add_executable(${BENCHMARK_TARGET} ${BENCHMARK_SOURCE})

diff --git a/benchmark/ConfigAutotuneSettings.cmake b/benchmark/ConfigAutotuneSettings.cmake
@@ -83,16 +83,16 @@ binary_search upper_bound lower_bound;${TUNING_TYPES};${LIMITED_TUNING_TYPES};64
     set(output_pattern_suffix "@SubAlgorithm@_@ValueType@_@OutputType@_@BlockSize@_@ItemsPerThread@" PARENT_SCOPE)
   elseif(file STREQUAL "benchmark_device_segmented_radix_sort_keys")
     set(list_across_names "\
-KeyType;BlockSize;ItemsPerThread;PartitionAllowed" PARENT_SCOPE)
-    set(list_across "${TUNING_TYPES};128 256;4 8 16;false" PARENT_SCOPE)
+KeyType;LongBits;BlockSize;ItemsPerThread;WarpSmallLWS;WarpSmallIPT;WarpSmallBS;WarpPartition;WarpMediumLWS;WarpMediumIPT;WarpMediumBS" PARENT_SCOPE)
+    set(list_across "${TUNING_TYPES};8;256;4 8 16;8;4;256;64;16;8;256" PARENT_SCOPE)
     set(output_pattern_suffix "\
-@KeyType@_@BlockSize@_@ItemsPerThread@_@PartitionAllowed@" PARENT_SCOPE)
+@KeyType@_@LongBits@_@BlockSize@_@ItemsPerThread@_@WarpSmallLWS@_@WarpSmallIPT@_@WarpSmallBS@_@WarpPartition@_@WarpMediumLWS@_@WarpMediumIPT@_@WarpMediumBS@" PARENT_SCOPE)
   elseif(file STREQUAL "benchmark_device_segmented_radix_sort_pairs")
     set(list_across_names "\
-KeyType;ValueType;BlockSize;ItemsPerThread;PartitionAllowed" PARENT_SCOPE)
-    set(list_across "${TUNING_TYPES};int8_t;64;4 8 16;true false" PARENT_SCOPE)
+KeyType;ValueType;LongBits;BlockSize;ItemsPerThread;WarpSmallLWS;WarpSmallIPT;WarpSmallBS;WarpPartition;WarpMediumLWS;WarpMediumIPT;WarpMediumBS" PARENT_SCOPE)
+    set(list_across "${TUNING_TYPES};int8_t;8;256;4 8 16;8;4;256;64;16;8;256" PARENT_SCOPE)
     set(output_pattern_suffix "\
-@KeyType@_@ValueType@_@BlockSize@_@ItemsPerThread@_@PartitionAllowed@" PARENT_SCOPE)
+@KeyType@_@ValueType@_@LongBits@_@BlockSize@_@ItemsPerThread@_@WarpSmallLWS@_@WarpSmallIPT@_@WarpSmallBS@_@WarpPartition@_@WarpMediumLWS@_@WarpMediumIPT@_@WarpMediumBS@" PARENT_SCOPE)
   elseif(file STREQUAL "benchmark_device_transform")
     set(list_across_names "\
 DataType;BlockSize;" PARENT_SCOPE)

diff --git a/benchmark/benchmark_device_segmented_radix_sort_keys.cpp b/benchmark/benchmark_device_segmented_radix_sort_keys.cpp
@@ -291,7 +291,7 @@ int main(int argc, char* argv[])
     config_autotune_register::register_benchmark_subset(benchmarks,
                                                         parallel_instance,
                                                         parallel_instances,
-                                                        size,
+                                                        min_size,
                                                         seed,
                                                         stream);
 #else

diff --git a/benchmark/benchmark_device_segmented_radix_sort_keys.parallel.cpp.in b/benchmark/benchmark_device_segmented_radix_sort_keys.parallel.cpp.in
@@ -1,6 +1,6 @@
 // MIT License
 //
-// Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -27,8 +27,20 @@
 
 namespace
 {
-auto benchmarks = config_autotune_register::create_bulk(device_segmented_radix_sort_benchmark_generator<@BlockSize@,
-                                                                                                        @ItemsPerThread@,
-                                                                                                        @KeyType@,
-                                                                                                        @PartitionAllowed@>::create);
+auto benchmarks = config_autotune_register::create_bulk(
+    device_segmented_radix_sort_benchmark_generator<
+        @LongBits@,
+        0,
+        @BlockSize@,
+        @ItemsPerThread@,
+        @WarpSmallLWS@,
+        @WarpSmallIPT@,
+        @WarpSmallBS@,
+        @WarpPartition@,
+        @WarpMediumLWS@,
+        @WarpMediumIPT@,
+        @WarpMediumBS@,
+        @KeyType@,
+        true
+    >::create);
 } // namespace
diff --git a/benchmark/benchmark_device_segmented_radix_sort_keys.parallel.hpp b/benchmark/benchmark_device_segmented_radix_sort_keys.parallel.hpp
@@ -261,92 +261,38 @@ struct device_segmented_radix_sort_benchmark : public config_autotune_interface
 
 template<typename Tp, template<Tp> class T, bool enable, Tp... Idx>
 struct decider;
-template<unsigned int BlockSize, unsigned int ItemsPerThread, typename Key, bool PartitionAllowed>
+
+template<unsigned int LongBits,
+         unsigned int ShortBits,
+         unsigned int BlockSize,
+         unsigned int ItemsPerThread,
+         unsigned int WarpSmallLWS,
+         unsigned int WarpSmallIPT,
+         unsigned int WarpSmallBS,
+         unsigned int WarpPartition,
+         unsigned int WarpMediumLWS,
+         unsigned int WarpMediumIPT,
+         unsigned int WarpMediumBS,
+         typename Key,
+         bool UnpartitionWarpAllowed = true>
 struct device_segmented_radix_sort_benchmark_generator
 {
-    template<unsigned int LongBits>
-    struct create_lrb
-    {
-        template<unsigned int ShortBits>
-        struct create_srb
-        {
-            template<bool EnableUnpartitionedWarpSort>
-            struct create_euws
-            {
-                template<unsigned int LogicalWarpSizeSmall>
-                struct create_lwss
-                {
-                    template<unsigned int PartitioningThreshold>
-                    struct create_pt
-                    {
-                        void operator()(
-                            std::vector<std::unique_ptr<config_autotune_interface>>& storage)
-                        {
-                            storage.emplace_back(
-                                std::make_unique<device_segmented_radix_sort_benchmark<
-                                    Key,
-                                    rocprim::segmented_radix_sort_config<
-                                        LongBits,
-                                        ShortBits,
-                                        rocprim::kernel_config<BlockSize, ItemsPerThread>,
-                                        rocprim::WarpSortConfig<LogicalWarpSizeSmall / 2,
-                                                                ItemsPerThread / 2,
-                                                                BlockSize,
-                                                                PartitioningThreshold,
-                                                                LogicalWarpSizeSmall,
-                                                                ItemsPerThread,
-                                                                BlockSize>,
-                                        EnableUnpartitionedWarpSort>>>());
-                        }
-                    };
-
-                    void
-                        operator()(std::vector<std::unique_ptr<config_autotune_interface>>& storage)
-                    {
-                        static_for_each<std::integer_sequence<unsigned int, 5>, create_pt>(storage);
-                    }
-                };
-
-                void operator()(std::vector<std::unique_ptr<config_autotune_interface>>& storage)
-                {
-                    if(PartitionAllowed)
-                    {
-
-                        static_for_each<std::integer_sequence<unsigned int, 8, 16, 32>,
-                                        create_lwss>(storage);
-                    }
-                    else
-                    {
-                        storage.emplace_back(
-                            std::make_unique<device_segmented_radix_sort_benchmark<
-                                Key,
-                                rocprim::segmented_radix_sort_config<
-                                    LongBits,
-                                    ShortBits,
-                                    rocprim::kernel_config<BlockSize, ItemsPerThread>,
-                                    rocprim::DisabledWarpSortConfig,
-                                    EnableUnpartitionedWarpSort>>>());
-                    }
-                }
-            };
-
-            void operator()(std::vector<std::unique_ptr<config_autotune_interface>>& storage)
-            {
-                decider<bool, create_euws, 1u << ShortBits <= BlockSize, true>::do_the_thing(
-                    storage);
-            }
-        };
-
-        void operator()(std::vector<std::unique_ptr<config_autotune_interface>>& storage)
-        {
-            decider<unsigned int, create_srb, 1u << LongBits <= BlockSize, 3, 5>::do_the_thing(
-                storage);
-        }
-    };
-
     static void create(std::vector<std::unique_ptr<config_autotune_interface>>& storage)
     {
-        static_for_each<std::integer_sequence<unsigned int, 4, 5>, create_lrb>(storage);
+        storage.emplace_back(std::make_unique<device_segmented_radix_sort_benchmark<
+                                 Key,
+                                 rocprim::segmented_radix_sort_config<
+                                     LongBits,
+                                     ShortBits,
+                                     rocprim::kernel_config<BlockSize, ItemsPerThread>,
+                                     rocprim::WarpSortConfig<WarpSmallLWS,
+                                                             WarpSmallIPT,
+                                                             WarpSmallBS,
+                                                             WarpPartition,
+                                                             WarpMediumLWS,
+                                                             WarpMediumIPT,
+                                                             WarpMediumBS>,
+                                     UnpartitionWarpAllowed>>>());
     }
 };
 

diff --git a/benchmark/benchmark_device_segmented_radix_sort_pairs.parallel.cpp.in b/benchmark/benchmark_device_segmented_radix_sort_pairs.parallel.cpp.in
@@ -1,6 +1,6 @@
 // MIT License
 //
-// Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -27,9 +27,21 @@
 
 namespace
 {
-auto benchmarks = config_autotune_register::create_bulk(device_segmented_radix_sort_benchmark_generator<@BlockSize@,
-                                                                                                        @ItemsPerThread@,
-                                                                                                        @KeyType@,
-                                                                                                        @ValueType@,
-                                                                                                        @PartitionAllowed@>::create);
+auto benchmarks = config_autotune_register::create_bulk(
+    device_segmented_radix_sort_benchmark_generator<
+        @LongBits@,
+        8,
+        @BlockSize@,
+        @ItemsPerThread@,
+        @WarpSmallLWS@,
+        @WarpSmallIPT@,
+        @WarpSmallBS@,
+        @WarpPartition@,
+        @WarpMediumLWS@,
+        @WarpMediumIPT@,
+        @WarpMediumBS@,
+        @KeyType@,
+        @ValueType@,
+        true
+    >::create);
 } // namespace