StreamHPC 2023-10-18 (#480)

* Excessive shared memory usage in block_shuffle fix * remove block_sort_algorithm template param from block_sort_kernel_impl and block_sort_impl * fixed compile errors * Updated ChangeLog.md * remove unnecessary code * fixed CHANGELOG.md to not be so verbose about non public api changes * Add dynamic dispatch and autotuning to device_adjacent_difference * Fix device_adjacent_difference storage type * ci: remove autotune dependency from build:benchmark The workaround needed to make this work is has major disadvantages, and our current workflow does not make use of this dependency anyway (Currently the generated configs are checked into the repository, so the CI would run the benchmarks on them on the next push to the merge-request). When we improve automation around autotuning this could be implemented with conditional jobs, but lets just drop the dependency for now. * test: fix indexing error test_type_helper<custom_16aligned>::get_random_data Indexing was 4 based when the type has 3 variables, therefore it was overflowing. Caught with address sanitizer. * fixes for compilation in debug for radix_sort - Add force inline to onesweep kernel, to avoid too much shared memory errors - Declare `block_radix_sort::radix_bits_per_pass` to fix linker errors * fix: Detect DPP & DPP broadcast support with __GFX<GENERATION>__ macros The amdgpu target in clang now provides the GFX generation as a predefined macro, so we no longer need to explicitly list all targets, which was bad for maintenance. Also replace the use of the generic `ROCPRIM_NAVI` which signals navi support, with `ROCPRIM_DETAIL_HAS_DPP_BROADCAST`, a macro that explicitly states what we're after. Also also makes sure that `ROCPRIM_DETAIL_USE_DPP` is always defined (to 0 when DPP is disabled), previously it was undefined when `ROCPRIM_DISABLE_DPP` was set. * refactor: Use __GFX<GENERATION>__ to detect NAVI cards * docs: Update CHANGELOG for DPP & ROCPRIM_NAVI fixes * remove deprecated structs and functions * rename scan_by_key_config_v2 to scan_by_key_config remove the option to use custom implemented config for scan_by_key update tests to not use custom implemented config for scan_by_key * remove the option to use custom implemented config for histogram update tests to not use custom implemented config for histogram * update config compile time check to a different pattern * update documentation comments for configs * change documentation comments * change documentation comments on device_radix_sort rename radix_sort_config_v2 to radix_sort_config * change documentation comment add static_assert to check type for reduce_config * update documentation comments remove wrap_scan_config function add static_assert to disallow custom scan_config type rename scan_config_v2 to scan_config * update documentation comments * update documentation comments make transform_config inherit from detail::transfomr_config_params remove wrap_transform_config add static assert to test for Config type in device_transform * remove wrap_adjacent_difference_config function add static_assert to test config type create default ctor for adjacent_difference_config * add missing transform_config ctor rewrite adjacent_difference_config ctor to match other config structs * fix binary search still using wrap_transform_config * implement static_asset to make binary_search only use binary search configs, but also work with the underlying transform * update changelog * remove some *_v2s that went under the radar * remove unnecessary default values * Add binary search, lower_bound and upper_bound documentation * host_warp_size() is replaced with two different versions with parameters. the new versions use either a device id or a stream to figure out the warp size of the device * comment out unused param names * fix typos in the documentation * move host_warp_size to config_type.hpp changed host_warp_size signatures to fit other similar functions * add error checks to host_warp_size calls in tests and benchmarks * fix format * add missing comment * fix error handling in lookback_scan_state.hpp * fix compilation error * change block_radix_rank_match and block_histogram_atomic to use rocprim::match_any instead of implementing same functionality * change radix_digit_count_helper to use rocprim::match_any instead of implementing same functionality added predicate param to rocprim::match_any to set invalid lanes and added tests for this functionality * add elect function to warp intrinsics add test for elect change block_histogram_atomic, block_radix_rank_match, device_histogram, device_radix_sort to use elect instead of copy-paste code * update match_any to return 0 when predicate is false * fix the bit check in elect function * update changelog.md * fix hard coded warps per block value to come from param in kernel * remove unused variables * fix review comments minor name changes update test update comments * update group_elect test tests multiple groups per warp doesn't check which exact thread is elected in a group, only that one is elected * remove unnecessary comments * remove expected from group_elect test fix compile error * fix overindexing * fix review comments update group_elect_test to have better coverage * format * fix review comments * fix perf regression * undo group_elect in block_histogram_atomic.hpp, because of perf impact * fix bad func name in CHANGELOG.md * fix merge errors * Fix reduce_by_key algorithm so keys[0] is not flagged as a new run when is nan * make device_radix_sort compatible with compiler provided __int128_t and __uint128_t * add ifdefs to only compile int128 parts on clang/gcc * update changelog * fix for int128 to_string labdas * add test for block_radix_sort int128 support * Implement block run length decode * Fix reduce_by_key algorithm so out of bounds items are not flagged as new runs for NaNs * Add reduce_by_key test to check that flagging is correct when keys are all different * Fix performance regression observed during tuning for gfx1030 and gfx1102 * Block Runlength Decode: Fix incorrect offsets and improve test * Remove duplicate key from .clang-format * Remove additional duplicates from clang-format * Fix binary_search upper/lower_bound config tuning Use specialized configurations for upper, lower, and binary search algorithms when preforming tuning * unify language around config params in documentation * Make the autotune build job run nightly * remove radix_sort_onesweep autotuning workaround * Resolve doxygen warnings for upstream PR * Enable get_device_from_stream for Windows * Use _ENABLE_EXTENDED_ALIGNED_STORAGE for windows build in rmake.py * Bump unreleased ROCm version --------- Co-authored-by: Ivan Siutsou <[email protected]> Co-authored-by: Bence Parajdi <[email protected]> Co-authored-by: Bálint Soproni <[email protected]> Co-authored-by: Gergely Meszaros <[email protected]> Co-authored-by: Beatriz Navidad Vilches <[email protected]> Co-authored-by: Mátyás Aradi <[email protected]>
ROCm · Nov 14, 2023 · f2347ab · f2347ab
1 parent 56daf45
commit f2347ab
Show file tree

Hide file tree

Showing 100 changed files with 5,376 additions and 2,565 deletions.
diff --git a/.clang-format b/.clang-format
@@ -58,8 +58,6 @@ BraceWrapping:
     AfterNamespace: true
     AfterStruct: true
     AfterUnion: true
-    BeforeCatch: true
-    BeforeElse: true
     AfterExternBlock: false
     BeforeCatch: true
     BeforeElse: true

diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,14 @@
 ### Build dirs ###
-build/
+build*/
+
+### clangd. ### 
+/.cache
+
+### Docs dirs ###
+doc/html/
+doc/xml/
+doc/latex/
+doc/*.tag
 
 # Created by https://www.gitignore.io/api/c++,cmake
 

diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -230,9 +230,7 @@ build:package:
 
 build:benchmark:
   stage: build
-  needs:
-    - job: "autotune:generate-config"
-      optional: true
+  needs: []
   tags:
     - rocm-build
   extends:
@@ -270,7 +268,7 @@ autotune:build:
   extends:
     - .cmake-minimum
     - .gpus:rocm-gpus
-    - .rules:manual
+    - .rules:benchmark
   variables:
     BENCHMARK_TARGETS: benchmark_config_tuning
   script:
@@ -282,6 +280,7 @@ autotune:build:
       -S $CI_PROJECT_DIR
       -G Ninja
       -D CMAKE_CXX_COMPILER="$AMDCLANG"
+      -D CMAKE_CXX_FLAGS="-Wno-#pragma-messages"
       -D CMAKE_BUILD_TYPE=Release
       -D BUILD_TEST=OFF
       -D BUILD_EXAMPLE=OFF
@@ -472,11 +471,7 @@ autotune:execute-tuning:
     # Exclude benchmark that is known to fail on gfx906
     # On ROCm 5.7 or later, check if this can be removed - the presumption is that the failure is caused by a compiler issue.
     - >
-      if [[ "${GPU_TARGET}" == "gfx906" ]] && [[ "${AUTOTUNE_ALGORITHM_REGEX}" == "" ]]; then
-        export AUTOTUNE_ALGORITHM_REGEX="-\{\"lvl\":\"device\",\"algo\":\"radix_sort_onesweep\",\"key_type\":\"short\",\"value_type\":\"short\",\"cfg\":\{\"histogram\":\{\"bs\":1024,\"ipt\":22},\"sort\":\{\"bs\":1024,\"ipt\":22},\"bits_per_place\":5,\"algorithm\":\"block_radix_rank_algorithm::match\"}}"
-      fi
-    - 'printf "CI Variables used in benchmarks:\nAUTOTUNE_RESULT_DIR: %s\nAUTOTUNE_FILENAME_REGEX: %s\nAUTOTUNE_ALGORITHM_REGEX: %s \nAUTOTUNE_SIZE: %s \nAUTOTUNE_TRIALS: %s\n" "$AUTOTUNE_RESULT_DIR" "$AUTOTUNE_FILENAME_REGEX" "$AUTOTUNE_ALGORITHM_REGEX" "$AUTOTUNE_SIZE" "$AUTOTUNE_TRIALS"'
-    - cd "${CI_PROJECT_DIR}"
+      cd "${CI_PROJECT_DIR}"
     - mkdir -p "${AUTOTUNE_RESULT_DIR}"
     - python3
       .gitlab/run_benchmarks.py

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,19 +2,39 @@
 
 Full documentation for rocPRIM is available at [https://rocprim.readthedocs.io/en/latest/](https://rocprim.readthedocs.io/en/latest/)
 
+## [Unreleased rocPRIM-3.0.0 for ROCm 6.1.0]
+### Added
+ - Added new primitive: `block_run_length_decode`.
+### Changed
+- Removed deprecated functionality: `reduce_by_key_config`, `MatchAny`, `scan_config`, `scan_by_key_config` and `radix_sort_config`.
+- Renamed `scan_config_v2` to `scan_config`, `scan_by_key_config_v2` to `scan_by_key_config`, `radix_sort_config_v2` to `radix_sort_config`, `reduce_by_key_config_v2` to `reduce_by_key_config`, `radix_sort_config_v2` to `radix_sort_config`.
+- Removed support for custom config types for device algorithms.
+- `host_warp_size()` was moved into `rocprim/device/config_types.hpp`, and now uses either a `device_id` or a `stream` parameter to query the proper device and a `device_id` out parameter. The return type is `hipError_t`.
+- Added support for __int128_t in `device_radix_sort` and `block_radix_sort`.
+### Fixed
+- Fixed build issues with `rmake.py` on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.
+
 ## [Unreleased rocPRIM-2.13.1 for ROCm 5.7.0]
 ### Added
 - `block_sort::sort()` overload for keys and values with a dynamic size, for all block sort algorithms. Additionally, all `block_sort::sort()` overloads with a dynamic size are now supported for `block_sort_algorithm::merge_sort` and `block_sort_algorithm::bitonic_sort`.
 - New two-way partition primitive `partition_two_way` which can write to two separate iterators.
+- Added config tuning and dynamic dispatch to `device_adjacent_difference` algorithm
+- New `rocprim::group_elect` warp intrinsic, which chooses one lane from the lanes enabled by a mask.
 ### Changed
 - Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm's parameters. New configuration `radix_sort_config_v2` is preferred instead.
 - Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
 - The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
 - Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm's parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
 - Improved the performance of `partition`.
+- `merge_sort_block_sort` will always use stable merge sort as it is faster than the fallback implementation.
+- The `rocprim::match_any` interface has a new parameter, `valid` to enalble/disable lanes. The default value is true, so it doesn't change the previous behaviour. 
 ### Fixed
 - Fixed build issue caused by missing header in `thread/thread_search.hpp`.
 - Fixed `rocprim::MatchAny` for devices with 64-bit warp size. The function `rocprim::MatchAny` is deprecated and `rocprim::match_any` is preferred instead.
+- Fixed `device_adjacent_difference` using more shared memory than required.
+- Fixed a compilation error when `ROCPRIM_DISABLE_DPP` is defined.
+- rocPRIM should be more robust for detecting GPU architecture features. Explicitly listing each architecture is no longer required by developers, fixing compilation failures when
+  targeting devices not known by rocPRIM.
 
 ## [rocPRIM-2.13.0 for ROCm 5.5.0]
 ### Added

diff --git a/benchmark/CMakeLists.txt b/benchmark/CMakeLists.txt
@@ -123,6 +123,7 @@ add_rocprim_benchmark(benchmark_block_histogram.cpp)
 add_rocprim_benchmark(benchmark_block_radix_sort.cpp)
 add_rocprim_benchmark(benchmark_block_radix_rank.cpp)
 add_rocprim_benchmark(benchmark_block_reduce.cpp)
+add_rocprim_benchmark(benchmark_block_run_length_decode.cpp)
 add_rocprim_benchmark(benchmark_block_scan.cpp)
 add_rocprim_benchmark(benchmark_block_sort.cpp)
 add_rocprim_benchmark(benchmark_config_dispatch.cpp)

diff --git a/benchmark/ConfigAutotuneSettings.cmake b/benchmark/ConfigAutotuneSettings.cmake
@@ -29,10 +29,10 @@ set(LIMITED_TUNING_TYPES "int64_t int short int8_t")
 
 function(read_config_autotune_settings file list_across_names list_across output_pattern_suffix)
   if(file STREQUAL "benchmark_device_adjacent_difference")
-    set(list_across_names "DataType;Left;InPlace;BlockSize;ItemsPerThread" PARENT_SCOPE)
+    set(list_across_names "DataType;Left;InPlace;BlockSize" PARENT_SCOPE)
     set(list_across "${TUNING_TYPES};\
-true false;true false;64 128;1 2 4 8 16" PARENT_SCOPE)
-    set(output_pattern_suffix "@DataType@_@Left@_@InPlace@_@BlockSize@_@ItemsPerThread@" PARENT_SCOPE)
+true;false true;32 64 128 256 512 1024" PARENT_SCOPE)
+    set(output_pattern_suffix "@DataType@_@Left@_@InPlace@_@BlockSize@" PARENT_SCOPE)
   elseif(file STREQUAL "benchmark_device_histogram")
     set(list_across_names "DataType;BlockSize" PARENT_SCOPE)
     set(list_across "${TUNING_TYPES};64 128 256" PARENT_SCOPE)

diff --git a/benchmark/benchmark_block_run_length_decode.cpp b/benchmark/benchmark_block_run_length_decode.cpp
@@ -0,0 +1,242 @@
+// MIT License
+//
+// Copyright (c) 2021-2023 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include "benchmark/benchmark.h"
+#include "benchmark_utils.hpp"
+#include "cmdparser.hpp"
+
+#include "rocprim/block/block_load.hpp"
+#include "rocprim/block/block_run_length_decode.hpp"
+#include "rocprim/block/block_store.hpp"
+
+#include <random>
+#include <vector>
+
+#ifndef DEFAULT_N
+const size_t DEFAULT_N = 1024 * 1024 * 32;
+#endif
+
+template<class ItemT,
+         class OffsetT,
+         unsigned BlockSize,
+         unsigned RunsPerThread,
+         unsigned DecodedItemsPerThread,
+         unsigned Trials>
+__global__
+    __launch_bounds__(BlockSize) void block_run_length_decode_kernel(const ItemT*   d_run_items,
+                                                                     const OffsetT* d_run_offsets,
+                                                                     ItemT*         d_decoded_items,
+                                                                     bool enable_store = false)
+{
+    using BlockRunLengthDecodeT
+        = rocprim::block_run_length_decode<ItemT, BlockSize, RunsPerThread, DecodedItemsPerThread>;
+
+    ItemT   run_items[RunsPerThread];
+    OffsetT run_offsets[RunsPerThread];
+
+    const unsigned global_thread_idx = BlockSize * hipBlockIdx_x + hipThreadIdx_x;
+    rocprim::block_load_direct_blocked(global_thread_idx, d_run_items, run_items);
+    rocprim::block_load_direct_blocked(global_thread_idx, d_run_offsets, run_offsets);
+
+    ROCPRIM_SHARED_MEMORY typename BlockRunLengthDecodeT::storage_type temp_storage;
+    BlockRunLengthDecodeT block_run_length_decode(run_items, run_offsets);
+
+    const OffsetT total_decoded_size
+        = d_run_offsets[(hipBlockIdx_x + 1) * BlockSize * RunsPerThread]
+          - d_run_offsets[hipBlockIdx_x * BlockSize * RunsPerThread];
+
+#pragma nounroll
+    for(unsigned i = 0; i < Trials; ++i)
+    {
+        OffsetT decoded_window_offset = 0;
+        while(decoded_window_offset < total_decoded_size)
+        {
+            ItemT decoded_items[DecodedItemsPerThread];
+            block_run_length_decode.run_length_decode(decoded_items, decoded_window_offset);
+
+            if(enable_store)
+            {
+                rocprim::block_store_direct_blocked(global_thread_idx,
+                                                    d_decoded_items + decoded_window_offset,
+                                                    decoded_items);
+            }
+
+            decoded_window_offset += BlockSize * DecodedItemsPerThread;
+        }
+    }
+}
+
+template<class ItemT,
+         class OffsetT,
+         unsigned MinRunLength,
+         unsigned MaxRunLength,
+         unsigned BlockSize,
+         unsigned RunsPerThread,
+         unsigned DecodedItemsPerThread,
+         unsigned Trials = 100>
+void run_benchmark(benchmark::State& state, hipStream_t stream, size_t N)
+{
+    constexpr auto runs_per_block  = BlockSize * RunsPerThread;
+    const auto     target_num_runs = 2 * N / (MinRunLength + MaxRunLength);
+    const auto     num_runs
+        = runs_per_block * ((target_num_runs + runs_per_block - 1) / runs_per_block);
+
+    std::vector<ItemT>   run_items(num_runs);
+    std::vector<OffsetT> run_offsets(num_runs + 1);
+
+    std::default_random_engine prng(std::random_device{}());
+    using ItemDistribution = std::conditional_t<std::is_integral<ItemT>::value,
+                                                std::uniform_int_distribution<ItemT>,
+                                                std::uniform_real_distribution<ItemT>>;
+    ItemDistribution                       run_item_dist(0, 100);
+    std::uniform_int_distribution<OffsetT> run_length_dist(MinRunLength, MaxRunLength);
+
+    for(size_t i = 0; i < num_runs; ++i)
+    {
+        run_items[i] = run_item_dist(prng);
+    }
+    for(size_t i = 1; i < num_runs + 1; ++i)
+    {
+        const OffsetT next_run_length = run_length_dist(prng);
+        run_offsets[i]                = run_offsets[i - 1] + next_run_length;
+    }
+    const OffsetT output_length = run_offsets.back();
+
+    ItemT* d_run_items{};
+    HIP_CHECK(hipMalloc(&d_run_items, run_items.size() * sizeof(ItemT)));
+    HIP_CHECK(hipMemcpy(d_run_items,
+                        run_items.data(),
+                        run_items.size() * sizeof(ItemT),
+                        hipMemcpyHostToDevice));
+
+    OffsetT* d_run_offsets{};
+    HIP_CHECK(hipMalloc(&d_run_offsets, run_offsets.size() * sizeof(OffsetT)));
+    HIP_CHECK(hipMemcpy(d_run_offsets,
+                        run_offsets.data(),
+                        run_offsets.size() * sizeof(OffsetT),
+                        hipMemcpyHostToDevice));
+
+    ItemT* d_output{};
+    HIP_CHECK(hipMalloc(&d_output, output_length * sizeof(ItemT)));
+
+    for(auto _ : state)
+    {
+        auto start = std::chrono::high_resolution_clock::now();
+        hipLaunchKernelGGL(HIP_KERNEL_NAME(block_run_length_decode_kernel<ItemT,
+                                                                          OffsetT,
+                                                                          BlockSize,
+                                                                          RunsPerThread,
+                                                                          DecodedItemsPerThread,
+                                                                          Trials>),
+                           dim3(num_runs / runs_per_block),
+                           dim3(BlockSize),
+                           0,
+                           stream,
+                           d_run_items,
+                           d_run_offsets,
+                           d_output);
+        HIP_CHECK(hipPeekAtLastError());
+        HIP_CHECK(hipDeviceSynchronize());
+
+        auto end = std::chrono::high_resolution_clock::now();
+        auto elapsed_seconds
+            = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
+
+        state.SetIterationTime(elapsed_seconds.count());
+    }
+    state.SetBytesProcessed(state.iterations() * output_length * sizeof(ItemT) * Trials);
+    state.SetItemsProcessed(state.iterations() * output_length * Trials);
+
+    HIP_CHECK(hipFree(d_run_items));
+    HIP_CHECK(hipFree(d_run_offsets));
+    HIP_CHECK(hipFree(d_output));
+}
+
+#define CREATE_BENCHMARK(IT, OT, MINRL, MAXRL, BS, RPT, DIPT)                                 \
+    benchmark::RegisterBenchmark("block_run_length_decode<Item Type:" #IT ",Offset Type:" #OT \
+                                 ",Min RunLength:" #MINRL ",Max RunLength:" #MAXRL            \
+                                 ",BlockSize: " #BS ",Runs Per Thread:" #RPT                  \
+                                 ",Decoded Items Per Thread:" #DIPT ">",                      \
+                                 &run_benchmark<IT, OT, MINRL, MAXRL, BS, RPT, DIPT>,         \
+                                 stream,                                                      \
+                                 size)
+
+int main(int argc, char* argv[])
+{
+    cli::Parser parser(argc, argv);
+    parser.set_optional<size_t>("size", "size", DEFAULT_N, "number of values");
+    parser.set_optional<int>("trials", "trials", -1, "number of iterations");
+    parser.run_and_exit_if_error();
+
+    // Parse argv
+    benchmark::Initialize(&argc, argv);
+    const size_t size   = parser.get<size_t>("size");
+    const int    trials = parser.get<int>("trials");
+
+    std::cout << "benchmark_block_run_length_decode" << std::endl;
+
+    // HIP
+    hipStream_t     stream = 0; // default
+    hipDeviceProp_t devProp;
+    int             device_id = 0;
+    HIP_CHECK(hipGetDevice(&device_id));
+    HIP_CHECK(hipGetDeviceProperties(&devProp, device_id));
+    std::cout << "[HIP] Device name: " << devProp.name << std::endl;
+
+    // Add benchmarks
+    std::vector<benchmark::internal::Benchmark*> benchmarks{
+        CREATE_BENCHMARK(int, int, 1, 5, 128, 2, 4),
+        CREATE_BENCHMARK(int, int, 1, 10, 128, 2, 4),
+        CREATE_BENCHMARK(int, int, 1, 50, 128, 2, 4),
+        CREATE_BENCHMARK(int, int, 1, 100, 128, 2, 4),
+        CREATE_BENCHMARK(int, int, 1, 500, 128, 2, 4),
+        CREATE_BENCHMARK(int, int, 1, 1000, 128, 2, 4),
+        CREATE_BENCHMARK(int, int, 1, 5000, 128, 2, 4),
+
+        CREATE_BENCHMARK(double, long long, 1, 5, 128, 2, 4),
+        CREATE_BENCHMARK(double, long long, 1, 10, 128, 2, 4),
+        CREATE_BENCHMARK(double, long long, 1, 50, 128, 2, 4),
+        CREATE_BENCHMARK(double, long long, 1, 100, 128, 2, 4),
+        CREATE_BENCHMARK(double, long long, 1, 500, 128, 2, 4),
+        CREATE_BENCHMARK(double, long long, 1, 1000, 128, 2, 4),
+        CREATE_BENCHMARK(double, long long, 1, 5000, 128, 2, 4)};
+
+    // Use manual timing
+    for(auto& b : benchmarks)
+    {
+        b->UseManualTime();
+        b->Unit(benchmark::kMillisecond);
+    }
+
+    // Force number of iterations
+    if(trials > 0)
+    {
+        for(auto& b : benchmarks)
+        {
+            b->Iterations(trials);
+        }
+    }
+
+    // Run benchmarks
+    benchmark::RunSpecifiedBenchmarks();
+    return 0;
+}
diff --git a/benchmark/benchmark_device_adjacent_difference.parallel.cpp.in b/benchmark/benchmark_device_adjacent_difference.parallel.cpp.in
@@ -1,6 +1,6 @@
 // MIT License
 //
-// Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -26,9 +26,11 @@
 #include "benchmark_device_adjacent_difference.parallel.hpp"
 
 namespace {
-    auto benchmarks = config_autotune_register::create<device_adjacent_difference_benchmark<
+    auto benchmarks = config_autotune_register::create_bulk(
+        device_adjacent_difference_benchmark_generator< 
         @DataType@, 
+        @BlockSize@,
         @Left@, 
-        @InPlace@,
-        rocprim::adjacent_difference_config<@BlockSize@, @ItemsPerThread@>>>();
+        @InPlace@>::create);
+
 }