Skip to content

Commit

Permalink
StreamHPC 2023-10-18 (#480)
Browse files Browse the repository at this point in the history
* Excessive shared memory usage in block_shuffle fix

* remove block_sort_algorithm template param from block_sort_kernel_impl and block_sort_impl

* fixed compile errors

* Updated ChangeLog.md

* remove unnecessary code

* fixed CHANGELOG.md to not be so verbose about non public api changes

* Add dynamic dispatch and autotuning to device_adjacent_difference

* Fix device_adjacent_difference storage type

* ci: remove autotune dependency from build:benchmark

The workaround needed to make this work is has major disadvantages,
and our current workflow does not make use of this dependency anyway
(Currently the generated configs are checked into the repository, so
the CI would run the benchmarks on them on the next push to the
merge-request).

When we improve automation around autotuning this could be implemented
with conditional jobs, but lets just drop the dependency for now.

* test: fix indexing error test_type_helper<custom_16aligned>::get_random_data

Indexing was 4 based when the type has 3 variables, therefore it was
overflowing. Caught with address sanitizer.

* fixes for compilation in debug for radix_sort

- Add force inline to onesweep kernel, to avoid too much shared memory
  errors
- Declare `block_radix_sort::radix_bits_per_pass` to fix linker errors

* fix: Detect DPP & DPP broadcast support with __GFX<GENERATION>__ macros

The amdgpu target in clang now provides the GFX generation as a
predefined macro, so we no longer need to explicitly list all targets,
which was bad for maintenance.

Also replace the use of the generic `ROCPRIM_NAVI` which signals navi support,
with `ROCPRIM_DETAIL_HAS_DPP_BROADCAST`, a macro that explicitly
states what we're after.

Also also makes sure that `ROCPRIM_DETAIL_USE_DPP` is always defined
(to 0 when DPP is disabled), previously it was undefined when
`ROCPRIM_DISABLE_DPP` was set.

* refactor: Use __GFX<GENERATION>__ to detect NAVI cards

* docs: Update CHANGELOG for DPP & ROCPRIM_NAVI fixes

* remove deprecated structs and functions

* rename scan_by_key_config_v2 to scan_by_key_config
remove the option to use custom implemented config for scan_by_key
update tests to not use custom implemented config for scan_by_key

* remove the option to use custom implemented config for histogram
update tests to not use custom implemented config for histogram

* update config compile time check to a different pattern

* update documentation comments for configs

* change documentation comments

* change documentation comments on device_radix_sort
rename radix_sort_config_v2 to radix_sort_config

* change documentation comment
add static_assert to check type for reduce_config

* update documentation comments
remove wrap_scan_config function
add static_assert to disallow custom scan_config type
rename scan_config_v2 to scan_config

* update documentation comments

* update documentation comments
make transform_config inherit from detail::transfomr_config_params
remove wrap_transform_config
add static assert to test for Config type in device_transform

* remove wrap_adjacent_difference_config function
add static_assert to test config type
create default ctor for adjacent_difference_config

* add missing transform_config ctor
rewrite adjacent_difference_config ctor to match other config structs

* fix binary search still using wrap_transform_config

* implement static_asset to make binary_search only use binary search configs, but also work with the underlying transform

* update changelog

* remove some *_v2s that went under the radar

* remove unnecessary default values

* Add binary search, lower_bound and upper_bound documentation

* host_warp_size() is replaced with two different versions with parameters.
the new versions use either a device id or a stream to figure out the warp size of the device

* comment out unused param names

* fix typos in the documentation

* move host_warp_size to config_type.hpp
changed host_warp_size signatures to fit other similar functions

* add error checks to host_warp_size calls in tests and benchmarks

* fix format

* add missing comment

* fix error handling in lookback_scan_state.hpp

* fix compilation error

* change block_radix_rank_match and block_histogram_atomic to use rocprim::match_any instead of implementing same functionality

* change radix_digit_count_helper to use rocprim::match_any instead of implementing same functionality
added predicate param to rocprim::match_any to set invalid lanes and added tests for this functionality

* add elect function to warp intrinsics
add test for elect
change block_histogram_atomic, block_radix_rank_match, device_histogram, device_radix_sort to use elect instead of copy-paste code

* update match_any to return 0 when predicate is false

* fix the bit check in elect function

* update changelog.md

* fix hard coded warps per block value to come from param in kernel

* remove unused variables

* fix review comments
minor name changes
update test
update comments

* update group_elect test
tests multiple groups per warp
doesn't check which exact thread is elected in a group, only that one is elected

* remove unnecessary comments

* remove expected from group_elect test
fix compile error

* fix overindexing

* fix review comments
update group_elect_test to have better coverage

* format

* fix review comments

* fix perf regression

* undo group_elect in block_histogram_atomic.hpp, because of perf impact

* fix bad func name in CHANGELOG.md

* fix merge errors

* Fix reduce_by_key algorithm so keys[0] is not flagged as a new run when is nan

* make device_radix_sort compatible with compiler provided __int128_t and __uint128_t

* add ifdefs to only compile int128 parts on clang/gcc

* update changelog

* fix for int128 to_string labdas

* add test for block_radix_sort int128 support

* Implement block run length decode

* Fix reduce_by_key algorithm so out of bounds items are not flagged as new runs for NaNs

* Add reduce_by_key test to check that flagging is correct when keys are all different

* Fix performance regression observed during tuning for gfx1030 and gfx1102

* Block Runlength Decode: Fix incorrect offsets and improve test

* Remove duplicate key from .clang-format

* Remove additional duplicates from clang-format

* Fix binary_search upper/lower_bound config tuning

Use specialized configurations for upper, lower, and binary search
algorithms when preforming tuning

* unify language around config params in documentation

* Make the autotune build job run nightly

* remove radix_sort_onesweep autotuning workaround

* Resolve doxygen warnings for upstream PR

* Enable get_device_from_stream for Windows

* Use _ENABLE_EXTENDED_ALIGNED_STORAGE for windows build in rmake.py

* Bump unreleased ROCm version

---------

Co-authored-by: Ivan Siutsou <[email protected]>
Co-authored-by: Bence Parajdi <[email protected]>
Co-authored-by: Bálint Soproni <[email protected]>
Co-authored-by: Gergely Meszaros <[email protected]>
Co-authored-by: Beatriz Navidad Vilches <[email protected]>
Co-authored-by: Mátyás Aradi <[email protected]>
  • Loading branch information
7 people authored Nov 14, 2023
1 parent 56daf45 commit f2347ab
Show file tree
Hide file tree
Showing 100 changed files with 5,376 additions and 2,565 deletions.
2 changes: 0 additions & 2 deletions .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,6 @@ BraceWrapping:
AfterNamespace: true
AfterStruct: true
AfterUnion: true
BeforeCatch: true
BeforeElse: true
AfterExternBlock: false
BeforeCatch: true
BeforeElse: true
Expand Down
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
### Build dirs ###
build/
build*/

### clangd. ###
/.cache

### Docs dirs ###
doc/html/
doc/xml/
doc/latex/
doc/*.tag

# Created by https://www.gitignore.io/api/c++,cmake

Expand Down
13 changes: 4 additions & 9 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -230,9 +230,7 @@ build:package:

build:benchmark:
stage: build
needs:
- job: "autotune:generate-config"
optional: true
needs: []
tags:
- rocm-build
extends:
Expand Down Expand Up @@ -270,7 +268,7 @@ autotune:build:
extends:
- .cmake-minimum
- .gpus:rocm-gpus
- .rules:manual
- .rules:benchmark
variables:
BENCHMARK_TARGETS: benchmark_config_tuning
script:
Expand All @@ -282,6 +280,7 @@ autotune:build:
-S $CI_PROJECT_DIR
-G Ninja
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_CXX_FLAGS="-Wno-#pragma-messages"
-D CMAKE_BUILD_TYPE=Release
-D BUILD_TEST=OFF
-D BUILD_EXAMPLE=OFF
Expand Down Expand Up @@ -472,11 +471,7 @@ autotune:execute-tuning:
# Exclude benchmark that is known to fail on gfx906
# On ROCm 5.7 or later, check if this can be removed - the presumption is that the failure is caused by a compiler issue.
- >
if [[ "${GPU_TARGET}" == "gfx906" ]] && [[ "${AUTOTUNE_ALGORITHM_REGEX}" == "" ]]; then
export AUTOTUNE_ALGORITHM_REGEX="-\{\"lvl\":\"device\",\"algo\":\"radix_sort_onesweep\",\"key_type\":\"short\",\"value_type\":\"short\",\"cfg\":\{\"histogram\":\{\"bs\":1024,\"ipt\":22},\"sort\":\{\"bs\":1024,\"ipt\":22},\"bits_per_place\":5,\"algorithm\":\"block_radix_rank_algorithm::match\"}}"
fi
- 'printf "CI Variables used in benchmarks:\nAUTOTUNE_RESULT_DIR: %s\nAUTOTUNE_FILENAME_REGEX: %s\nAUTOTUNE_ALGORITHM_REGEX: %s \nAUTOTUNE_SIZE: %s \nAUTOTUNE_TRIALS: %s\n" "$AUTOTUNE_RESULT_DIR" "$AUTOTUNE_FILENAME_REGEX" "$AUTOTUNE_ALGORITHM_REGEX" "$AUTOTUNE_SIZE" "$AUTOTUNE_TRIALS"'
- cd "${CI_PROJECT_DIR}"
cd "${CI_PROJECT_DIR}"
- mkdir -p "${AUTOTUNE_RESULT_DIR}"
- python3
.gitlab/run_benchmarks.py
Expand Down
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,39 @@

Full documentation for rocPRIM is available at [https://rocprim.readthedocs.io/en/latest/](https://rocprim.readthedocs.io/en/latest/)

## [Unreleased rocPRIM-3.0.0 for ROCm 6.1.0]
### Added
- Added new primitive: `block_run_length_decode`.
### Changed
- Removed deprecated functionality: `reduce_by_key_config`, `MatchAny`, `scan_config`, `scan_by_key_config` and `radix_sort_config`.
- Renamed `scan_config_v2` to `scan_config`, `scan_by_key_config_v2` to `scan_by_key_config`, `radix_sort_config_v2` to `radix_sort_config`, `reduce_by_key_config_v2` to `reduce_by_key_config`, `radix_sort_config_v2` to `radix_sort_config`.
- Removed support for custom config types for device algorithms.
- `host_warp_size()` was moved into `rocprim/device/config_types.hpp`, and now uses either a `device_id` or a `stream` parameter to query the proper device and a `device_id` out parameter. The return type is `hipError_t`.
- Added support for __int128_t in `device_radix_sort` and `block_radix_sort`.
### Fixed
- Fixed build issues with `rmake.py` on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.

## [Unreleased rocPRIM-2.13.1 for ROCm 5.7.0]
### Added
- `block_sort::sort()` overload for keys and values with a dynamic size, for all block sort algorithms. Additionally, all `block_sort::sort()` overloads with a dynamic size are now supported for `block_sort_algorithm::merge_sort` and `block_sort_algorithm::bitonic_sort`.
- New two-way partition primitive `partition_two_way` which can write to two separate iterators.
- Added config tuning and dynamic dispatch to `device_adjacent_difference` algorithm
- New `rocprim::group_elect` warp intrinsic, which chooses one lane from the lanes enabled by a mask.
### Changed
- Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm's parameters. New configuration `radix_sort_config_v2` is preferred instead.
- Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
- The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
- Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm's parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
- Improved the performance of `partition`.
- `merge_sort_block_sort` will always use stable merge sort as it is faster than the fallback implementation.
- The `rocprim::match_any` interface has a new parameter, `valid` to enalble/disable lanes. The default value is true, so it doesn't change the previous behaviour.
### Fixed
- Fixed build issue caused by missing header in `thread/thread_search.hpp`.
- Fixed `rocprim::MatchAny` for devices with 64-bit warp size. The function `rocprim::MatchAny` is deprecated and `rocprim::match_any` is preferred instead.
- Fixed `device_adjacent_difference` using more shared memory than required.
- Fixed a compilation error when `ROCPRIM_DISABLE_DPP` is defined.
- rocPRIM should be more robust for detecting GPU architecture features. Explicitly listing each architecture is no longer required by developers, fixing compilation failures when
targeting devices not known by rocPRIM.

## [rocPRIM-2.13.0 for ROCm 5.5.0]
### Added
Expand Down
1 change: 1 addition & 0 deletions benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ add_rocprim_benchmark(benchmark_block_histogram.cpp)
add_rocprim_benchmark(benchmark_block_radix_sort.cpp)
add_rocprim_benchmark(benchmark_block_radix_rank.cpp)
add_rocprim_benchmark(benchmark_block_reduce.cpp)
add_rocprim_benchmark(benchmark_block_run_length_decode.cpp)
add_rocprim_benchmark(benchmark_block_scan.cpp)
add_rocprim_benchmark(benchmark_block_sort.cpp)
add_rocprim_benchmark(benchmark_config_dispatch.cpp)
Expand Down
6 changes: 3 additions & 3 deletions benchmark/ConfigAutotuneSettings.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,10 @@ set(LIMITED_TUNING_TYPES "int64_t int short int8_t")

function(read_config_autotune_settings file list_across_names list_across output_pattern_suffix)
if(file STREQUAL "benchmark_device_adjacent_difference")
set(list_across_names "DataType;Left;InPlace;BlockSize;ItemsPerThread" PARENT_SCOPE)
set(list_across_names "DataType;Left;InPlace;BlockSize" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};\
true false;true false;64 128;1 2 4 8 16" PARENT_SCOPE)
set(output_pattern_suffix "@DataType@_@Left@_@InPlace@_@BlockSize@_@ItemsPerThread@" PARENT_SCOPE)
true;false true;32 64 128 256 512 1024" PARENT_SCOPE)
set(output_pattern_suffix "@DataType@_@Left@_@InPlace@_@BlockSize@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_histogram")
set(list_across_names "DataType;BlockSize" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};64 128 256" PARENT_SCOPE)
Expand Down
242 changes: 242 additions & 0 deletions benchmark/benchmark_block_run_length_decode.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
// MIT License
//
// Copyright (c) 2021-2023 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.

#include "benchmark/benchmark.h"
#include "benchmark_utils.hpp"
#include "cmdparser.hpp"

#include "rocprim/block/block_load.hpp"
#include "rocprim/block/block_run_length_decode.hpp"
#include "rocprim/block/block_store.hpp"

#include <random>
#include <vector>

#ifndef DEFAULT_N
const size_t DEFAULT_N = 1024 * 1024 * 32;
#endif

template<class ItemT,
class OffsetT,
unsigned BlockSize,
unsigned RunsPerThread,
unsigned DecodedItemsPerThread,
unsigned Trials>
__global__
__launch_bounds__(BlockSize) void block_run_length_decode_kernel(const ItemT* d_run_items,
const OffsetT* d_run_offsets,
ItemT* d_decoded_items,
bool enable_store = false)
{
using BlockRunLengthDecodeT
= rocprim::block_run_length_decode<ItemT, BlockSize, RunsPerThread, DecodedItemsPerThread>;

ItemT run_items[RunsPerThread];
OffsetT run_offsets[RunsPerThread];

const unsigned global_thread_idx = BlockSize * hipBlockIdx_x + hipThreadIdx_x;
rocprim::block_load_direct_blocked(global_thread_idx, d_run_items, run_items);
rocprim::block_load_direct_blocked(global_thread_idx, d_run_offsets, run_offsets);

ROCPRIM_SHARED_MEMORY typename BlockRunLengthDecodeT::storage_type temp_storage;
BlockRunLengthDecodeT block_run_length_decode(run_items, run_offsets);

const OffsetT total_decoded_size
= d_run_offsets[(hipBlockIdx_x + 1) * BlockSize * RunsPerThread]
- d_run_offsets[hipBlockIdx_x * BlockSize * RunsPerThread];

#pragma nounroll
for(unsigned i = 0; i < Trials; ++i)
{
OffsetT decoded_window_offset = 0;
while(decoded_window_offset < total_decoded_size)
{
ItemT decoded_items[DecodedItemsPerThread];
block_run_length_decode.run_length_decode(decoded_items, decoded_window_offset);

if(enable_store)
{
rocprim::block_store_direct_blocked(global_thread_idx,
d_decoded_items + decoded_window_offset,
decoded_items);
}

decoded_window_offset += BlockSize * DecodedItemsPerThread;
}
}
}

template<class ItemT,
class OffsetT,
unsigned MinRunLength,
unsigned MaxRunLength,
unsigned BlockSize,
unsigned RunsPerThread,
unsigned DecodedItemsPerThread,
unsigned Trials = 100>
void run_benchmark(benchmark::State& state, hipStream_t stream, size_t N)
{
constexpr auto runs_per_block = BlockSize * RunsPerThread;
const auto target_num_runs = 2 * N / (MinRunLength + MaxRunLength);
const auto num_runs
= runs_per_block * ((target_num_runs + runs_per_block - 1) / runs_per_block);

std::vector<ItemT> run_items(num_runs);
std::vector<OffsetT> run_offsets(num_runs + 1);

std::default_random_engine prng(std::random_device{}());
using ItemDistribution = std::conditional_t<std::is_integral<ItemT>::value,
std::uniform_int_distribution<ItemT>,
std::uniform_real_distribution<ItemT>>;
ItemDistribution run_item_dist(0, 100);
std::uniform_int_distribution<OffsetT> run_length_dist(MinRunLength, MaxRunLength);

for(size_t i = 0; i < num_runs; ++i)
{
run_items[i] = run_item_dist(prng);
}
for(size_t i = 1; i < num_runs + 1; ++i)
{
const OffsetT next_run_length = run_length_dist(prng);
run_offsets[i] = run_offsets[i - 1] + next_run_length;
}
const OffsetT output_length = run_offsets.back();

ItemT* d_run_items{};
HIP_CHECK(hipMalloc(&d_run_items, run_items.size() * sizeof(ItemT)));
HIP_CHECK(hipMemcpy(d_run_items,
run_items.data(),
run_items.size() * sizeof(ItemT),
hipMemcpyHostToDevice));

OffsetT* d_run_offsets{};
HIP_CHECK(hipMalloc(&d_run_offsets, run_offsets.size() * sizeof(OffsetT)));
HIP_CHECK(hipMemcpy(d_run_offsets,
run_offsets.data(),
run_offsets.size() * sizeof(OffsetT),
hipMemcpyHostToDevice));

ItemT* d_output{};
HIP_CHECK(hipMalloc(&d_output, output_length * sizeof(ItemT)));

for(auto _ : state)
{
auto start = std::chrono::high_resolution_clock::now();
hipLaunchKernelGGL(HIP_KERNEL_NAME(block_run_length_decode_kernel<ItemT,
OffsetT,
BlockSize,
RunsPerThread,
DecodedItemsPerThread,
Trials>),
dim3(num_runs / runs_per_block),
dim3(BlockSize),
0,
stream,
d_run_items,
d_run_offsets,
d_output);
HIP_CHECK(hipPeekAtLastError());
HIP_CHECK(hipDeviceSynchronize());

auto end = std::chrono::high_resolution_clock::now();
auto elapsed_seconds
= std::chrono::duration_cast<std::chrono::duration<double>>(end - start);

state.SetIterationTime(elapsed_seconds.count());
}
state.SetBytesProcessed(state.iterations() * output_length * sizeof(ItemT) * Trials);
state.SetItemsProcessed(state.iterations() * output_length * Trials);

HIP_CHECK(hipFree(d_run_items));
HIP_CHECK(hipFree(d_run_offsets));
HIP_CHECK(hipFree(d_output));
}

#define CREATE_BENCHMARK(IT, OT, MINRL, MAXRL, BS, RPT, DIPT) \
benchmark::RegisterBenchmark("block_run_length_decode<Item Type:" #IT ",Offset Type:" #OT \
",Min RunLength:" #MINRL ",Max RunLength:" #MAXRL \
",BlockSize: " #BS ",Runs Per Thread:" #RPT \
",Decoded Items Per Thread:" #DIPT ">", \
&run_benchmark<IT, OT, MINRL, MAXRL, BS, RPT, DIPT>, \
stream, \
size)

int main(int argc, char* argv[])
{
cli::Parser parser(argc, argv);
parser.set_optional<size_t>("size", "size", DEFAULT_N, "number of values");
parser.set_optional<int>("trials", "trials", -1, "number of iterations");
parser.run_and_exit_if_error();

// Parse argv
benchmark::Initialize(&argc, argv);
const size_t size = parser.get<size_t>("size");
const int trials = parser.get<int>("trials");

std::cout << "benchmark_block_run_length_decode" << std::endl;

// HIP
hipStream_t stream = 0; // default
hipDeviceProp_t devProp;
int device_id = 0;
HIP_CHECK(hipGetDevice(&device_id));
HIP_CHECK(hipGetDeviceProperties(&devProp, device_id));
std::cout << "[HIP] Device name: " << devProp.name << std::endl;

// Add benchmarks
std::vector<benchmark::internal::Benchmark*> benchmarks{
CREATE_BENCHMARK(int, int, 1, 5, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 10, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 50, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 100, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 500, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 1000, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 5000, 128, 2, 4),

CREATE_BENCHMARK(double, long long, 1, 5, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 10, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 50, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 100, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 500, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 1000, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 5000, 128, 2, 4)};

// Use manual timing
for(auto& b : benchmarks)
{
b->UseManualTime();
b->Unit(benchmark::kMillisecond);
}

// Force number of iterations
if(trials > 0)
{
for(auto& b : benchmarks)
{
b->Iterations(trials);
}
}

// Run benchmarks
benchmark::RunSpecifiedBenchmarks();
return 0;
}
10 changes: 6 additions & 4 deletions benchmark/benchmark_device_adjacent_difference.parallel.cpp.in
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// MIT License
//
// Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
Expand All @@ -26,9 +26,11 @@
#include "benchmark_device_adjacent_difference.parallel.hpp"

namespace {
auto benchmarks = config_autotune_register::create<device_adjacent_difference_benchmark<
auto benchmarks = config_autotune_register::create_bulk(
device_adjacent_difference_benchmark_generator<
@DataType@,
@BlockSize@,
@Left@,
@InPlace@,
rocprim::adjacent_difference_config<@BlockSize@, @ItemsPerThread@>>>();
@InPlace@>::create);

}
Loading

0 comments on commit f2347ab

Please sign in to comment.