Skip to content

Commit

Permalink
Develop stream 2024-10-29 (#631)
Browse files Browse the repository at this point in the history
* remove HIP-CPU support

* Resolve: IssueMove ROCPRIM_DETAIL_HIP_SYNC_AND_RETURN_ON_ERROR to seperate header file

* rebase and add RETURN_ON_ERROR to the header

* Added naive implementation for adjacent_find plus tests and benchmarks

* Improved benchmark by only taking into account relevant processed elements

* Use a faster reduction operation

* Added block-reduction kernel with early exit

* Improved test with random first pair

* Get grid_size for maximum occupancy

* Improved test coverage

* Implement early exit with sequential blocks execution

* Use a dynamic tile_id as in find_first_of for faster stable results

* Added documentation for adjacent_find

* Added tuning for adjacent_find

* Modified tuning so that non-arithmetic types use default configs

* Changed initialization mechanism of kernel's output element

* Fixed tests from review comments

- Simplified adjacent_find_impl functor definition
- Added test for indirect_iterator

* Simplified input transform logic

* Added tuned configs

* Removed duplicated ROCPRIM_DETAIL_HIP_SYNC_AND_RETURN_ON_ERROR

* Resolve "Refactor benchmarks to use a byte-based size"

* Added a rocprim::numeric_limits to support uint128 and int128 and changed all std::numeric_limits to test_utils::numeric_limits

* Create generate_limit to ensure floating point custom types are handled correctly.

* Add rocprim::numeric_limits to numeric_limits_custom_test_type

* Expected output fix block_radix_sort test for custom_test_type<float> and custom_test_type<double>

* Docs fix numeric_limits

* Added numeric_limits to changelog

* Added a rocprim::uint128_t and rocprim::int128_t

* Implemented find_end with tests and benchmark

* Updated find_end benchmark with generate_limits

* Added different input pattern for benchmark and added multiple items per
thread

* Added different key_size to tests for find_end

* Added shared memory kernel for find_end

* Changed find_end to search with reverse iterator

* Added tests for different compare function

* Change benchmark to no longer early exit and choosing shared mem kernel as config variable

* Extra check search kernel to prevent unnessary global search

* Documentation for find_end

* Changed find_end to make it easier to create search

* Fix docs errors find_end

* Changes for reviews find_end

* Fix rebasing issues find_end

* Added find_end to rocprim header

* Fix build error after adding headers

* Use byte-based size in benchmark

* Remove double defines

* Added search function with  tests and benchmark

* Fix documentation find_end and search

* Add device_search to rocprim.hpp header

* add device_ptr usility

Authored-By: Cenxuan Tian <[email protected]>

* replace high_resolution_clock with steady_clock

Authorized-By: Cenxuan Tian <[email protected]>

* properly namespace ROCPRIM_RETURN_ON_ERROR

* Set c++ version to 17 and create warning

* Fix no_discard warning c++17

* Set CI tests to c++14

* Build for both c++ 14 and 17

* Add large sizes test to device_radix_sort

* Added more test coverage segmented_radix_sort

* fix not working with const_iterators

* fix: use bytes instead of size for scan tuning benchmarks

* Resolve "Partial sort optimization: make use of radix sort"

* doc: address the upper bound restrictions on Channels for device_histogram

* doc: explicitly state that ActiveChannels is bounded by Channels

* batch memcpy tests with random seed

* follow clang format

* add newline at the end

* make rocprim::reverse_iterator align with that of std

* minor change

* add constexpr

* adjust format

* add warnings

* adjust format

* change the way of triggering warnings

* adjust format

* minor change

* adjust format

* clear warnings

* adjust format

* correct warning behaviours

* adjust format

* adjust format

* update changelog and fix warning issue

* fix ambiguous issue

* move a CHANGELOG entry to Deprecations section

* feat: add support for predicated flagged device select

* feat: add tests (with large indices) for predicated flagged device select

* feat: add config tuning and benchmarks for predicated and flagged device select

* fix: add missing template parameter to partition-based autotune templates

* Add tuned configs

* Fix clang-format hang

* Fix ambiguous error make_reverse_iterator

* Resolve "Config tuning and dynamic dispatch for device merge"

* add search_n algo

* add test

* Add google test for search_n & tested the functionality

* Add benchmark

* Add Doc & add custom type for benchmark

* Remove unused variables

* Add NonBlockStream support

* Remove unused type alias

* Refactor search_n for loop,  &dit comments &

* Add More tests & Fixed some bugs

* Add more benckmarks

* Add document

* Refactor benchmarks

* Replace another DOXYGEN_DOCUMENTATION_BUILD and some minor modifications

* Fix build debug error

* Optimize algo with large input

* add impl2

* Optimize

* Move hipMalloc vars to temp_memory

* Rewrite benchmarks

* Resolve

* Fix bugs -- several occurrences of consecutive full blocks

* Many modifications, fixed the bugs and edited the tests and benchmarks

* Optimised the block_search_n_kernel

* 2nd version search_n implementation for large input

* Add thread level search_n algorithm

* Add optimizations

* Edit benchmarks

* remove unused variables

* remove unused variables and remove __restrict__

* fix the bug on windows

* fix bug and modify benchmakrs and tests

* fix bugs in benchmarks and search_n_impl

* Oh yes

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Beatriz Navidad Vilches <[email protected]>

* apply some suggestions

* edit doc

* replace search_n_min_kernal by rocprim:reduce

* fixed some benchmarks bugs

* remove graph support

* resolve not compile on win

* Add graph support and modified the design a little

* resolve test fail on windws

* fix gfx960 benchmark dead lock

* Add device_search_n to rocprim.hpp

* replace HIP_CHECK by ROCPRIM_RETURN_ON_ERROR

* fix: fix doxygen error due to __launch_bounds__ macro

* Implement 6.3 hotfixes for added/modified tests

* Workaround CI memory usage limit

* Reduce memory usage even more

---------

Co-authored-by: Robin Voetter <[email protected]>
Co-authored-by: Cenxuan Tian <[email protected]>
Co-authored-by: Milo Lurati <[email protected]>
Co-authored-by: Nick Breed <[email protected]>
Co-authored-by: Bence Parajdi <[email protected]>
Co-authored-by: Yung-sheng Tu <[email protected]>
  • Loading branch information
7 people authored Nov 20, 2024
1 parent 22a23f8 commit 4008834
Show file tree
Hide file tree
Showing 207 changed files with 15,989 additions and 2,572 deletions.
11 changes: 11 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ copyright-date:
-D AMDGPU_TEST_TARGETS=$GPU_TARGETS
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CXX_STANDARD=14
-S $CI_PROJECT_DIR
-B $BUILD_DIR
- cmake
Expand Down Expand Up @@ -182,6 +183,7 @@ build:cmake-minimum-apt:
-D BUILD_EXAMPLE=ON
-D GPU_TARGETS=$GPU_TARGETS
-D AMDGPU_TEST_TARGETS=$GPU_TARGETS
-D CMAKE_CXX_STANDARD="$BUILD_VERSION"
-S $CI_PROJECT_DIR
-B $BUILD_DIR
- cmake --build $BUILD_DIR
Expand Down Expand Up @@ -210,6 +212,7 @@ build:cmake-latest:
matrix:
- BUILD_TYPE: Release
BUILD_TARGET: [BENCHMARK, TEST]
BUILD_VERSION: [14, 17]

build:cmake-minimum:
needs: []
Expand All @@ -220,6 +223,7 @@ build:cmake-minimum:
matrix:
- BUILD_TYPE: [Debug, Release]
BUILD_TARGET: [BENCHMARK, TEST]
BUILD_VERSION: 14

build:package:
stage: build
Expand All @@ -236,6 +240,7 @@ build:package:
-G Ninja
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_BUILD_TYPE=Release
-D CMAKE_CXX_STANDARD=14
-B $PACKAGE_DIR
-S $CI_PROJECT_DIR
- cd $PACKAGE_DIR
Expand Down Expand Up @@ -268,6 +273,7 @@ build:windows:
-D CMAKE_CXX_COMPILER:PATH="${env:HIP_PATH}\bin\clang++.exe"
-D CMAKE_PREFIX_PATH:PATH="${env:HIP_PATH}"
-D CMAKE_BUILD_TYPE="$BUILD_TYPE"
-D CMAKE_CXX_STANDARD=14
- cmake --build "$CI_PROJECT_DIR/build"
artifacts:
paths:
Expand Down Expand Up @@ -314,6 +320,7 @@ autotune:build:
-D GPU_TARGETS=$GPU_TARGETS
-D CMAKE_C_COMPILER_LAUNCHER=phc_sccache_c
-D CMAKE_CXX_COMPILER_LAUNCHER=phc_sccache_cxx
-D CMAKE_CXX_STANDARD=14
- cmake --build . --target $BENCHMARK_TARGETS
- 'rm -rf $BUILD_DIR/benchmark/benchmark*.parallel'
# The autotune benchmarks get very large, above GitLabs upload limit. Fortunately they compress well.
Expand All @@ -339,6 +346,7 @@ test:
matrix:
- BUILD_TYPE: Release
BUILD_TARGET: TEST
BUILD_VERSION: 14
script:
- cd $BUILD_DIR
- cmake
Expand Down Expand Up @@ -395,6 +403,7 @@ test-windows-release:
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_BUILD_TYPE=Release
-D GPU_TARGETS=$GPU_TARGETS
-D CMAKE_CXX_STANDARD=14
-S "$CI_PROJECT_DIR/test/extra"
-B "$CI_PROJECT_DIR/package_test"
- cmake --build "$CI_PROJECT_DIR/package_test"
Expand All @@ -416,6 +425,7 @@ test:install:
-G Ninja
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_BUILD_TYPE=Release
-D CMAKE_CXX_STANDARD=14
-B build
-S $CI_PROJECT_DIR
- $SUDO_CMD cmake --build build --target install
Expand Down Expand Up @@ -458,6 +468,7 @@ benchmark:
matrix:
- BUILD_TYPE: Release
BUILD_TARGET: BENCHMARK
BUILD_VERSION: 14
extends:
- .cmake-minimum
- .gpus:rocm
Expand Down
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,13 @@ Full documentation for rocPRIM is available at [https://rocm.docs.amd.com/projec
* Added `--emulation` option added for `rtest.py`
* Unit tests can be run with `[--emulation|-e|--test|-t]=<test_name>`
* Added tuned configurations for segmented radix sort for gfx942 to improve performance on this architecture.
* Added a parallel device-level function, `rocprim::adjacent_find`, similar to the C++ Standard Library `std::adjacent_find` algorithm.
* Added configuration autotuning to device adjacent find (`rocprim::adjacent_find`) for improved performance on selected architectures.
* Added rocprim::numeric_limits which is an extension of `std::numeric_limits`, which includes support for 128-bit integers.
* Added rocprim::int128_t and rocprim::uint128_t which are the __int128_t and __uint128_t types.
* Added the parallel `search` and `find_end` device functions similar to `std::search` and `std::find_end`, these functions search for the first and last occurrence of the sequence respectively.
* Added a parallel device-level function, `rocprim::search_n`, similar to the C++ Standard Library `std::search_n` algorithm.
* Added new constructors and a `base` function, and added `constexpr` specifier to all functions in `rocprim::reverse_iterator` to improve parity with the C++17 `std::reverse_iterator`.

### Changed

Expand All @@ -22,6 +29,9 @@ Full documentation for rocPRIM is available at [https://rocm.docs.amd.com/projec
* Changed the internal algorithm of block radix sort to use rank match to improve performance of various radix sort related algorithms.
* Disabled padding in various cases where higher occupancy resulted in better performance despite more bank conflicts.

* Removed HIP-CPU support. HIP-CPU support was experimental and broken.
* Changed the C++ version from 14 to 17. C++14 will be deprecated in the next major release.

### Resolved issues

* Fixed an issue where `rmake.py` would generate wrong CMAKE commands while using Linux environment
Expand All @@ -30,6 +40,7 @@ Full documentation for rocPRIM is available at [https://rocm.docs.amd.com/projec
* Fixed compilation issue when `rocprim::radix_key_codec<...>` is specialized with a 128-bit integer.

### Upcoming changes
* Using the initialisation constructor of `rocprim::reverse_iterator` will throw a deprecation warning. It will be marked as explicit in the next major release.

## rocPRIM 3.3.0 for ROCm 6.3.0

Expand Down
77 changes: 39 additions & 38 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,6 @@ option(BUILD_BENCHMARK "Build benchmarks" OFF)
option(BUILD_NAIVE_BENCHMARK "Build naive benchmarks" OFF)
option(BUILD_EXAMPLE "Build examples" OFF)
option(BUILD_DOCS "Build documentation (requires sphinx)" OFF)
option(USE_HIP_CPU "Prefer HIP-CPU runtime instead of HW acceleration" OFF)
# Disables building tests, benchmarks, examples
option(ONLY_INSTALL "Only install" OFF)
option(BUILD_CODE_COVERAGE "Build with code coverage enabled" OFF)
Expand All @@ -70,50 +69,57 @@ endif()
set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE CACHE BOOL "Add paths to linker search and installed rpath")

# Set CXX flags
set(CMAKE_CXX_STANDARD 14)
if (NOT DEFINED CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 17)
endif()
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)

if (CMAKE_CXX_STANDARD EQUAL 14)
message(WARNING "C++14 will be deprecated in the next major release")
elseif(NOT CMAKE_CXX_STANDARD EQUAL 17)
message(FATAL_ERROR "Only C++14 and C++17 are supported")
endif()

if(DEFINED BUILD_SHARED_LIBS)
set(PKG_BUILD_SHARED_LIBS ${BUILD_SHARED_LIBS})
else()
else()
set(PKG_BUILD_SHARED_LIBS ON)
endif()
endif()
set(BUILD_SHARED_LIBS OFF) # don't build client dependencies as shared
if(NOT USE_HIP_CPU)
# Get dependencies (required here to get rocm-cmake)
include(cmake/Dependencies.cmake)
# Use target ID syntax if supported for GPU_TARGETS
if (NOT DEFINED AMDGPU_TARGETS)
set(GPU_TARGETS "all" CACHE STRING "GPU architectures to compile for")

# Get dependencies (required here to get rocm-cmake)
include(cmake/Dependencies.cmake)
# Use target ID syntax if supported for GPU_TARGETS
if (NOT DEFINED AMDGPU_TARGETS)
set(GPU_TARGETS "all" CACHE STRING "GPU architectures to compile for")
else()
set(GPU_TARGETS "${AMDGPU_TARGETS}" CACHE STRING "GPU architectures to compile for")
endif()
set_property(CACHE GPU_TARGETS PROPERTY STRINGS "all")

if(GPU_TARGETS STREQUAL "all")
if(BUILD_ADDRESS_SANITIZER)
# ASAN builds require xnack
rocm_check_target_ids(DEFAULT_AMDGPU_TARGETS
TARGETS "gfx908:xnack+;gfx90a:xnack+;gfx940:xnack+;gfx941:xnack+;gfx942:xnack+"
)
else()
set(GPU_TARGETS "${AMDGPU_TARGETS}" CACHE STRING "GPU architectures to compile for")
endif()
set_property(CACHE GPU_TARGETS PROPERTY STRINGS "all")

if(GPU_TARGETS STREQUAL "all")
if(BUILD_ADDRESS_SANITIZER)
# ASAN builds require xnack
rocm_check_target_ids(DEFAULT_AMDGPU_TARGETS
TARGETS "gfx908:xnack+;gfx90a:xnack+;gfx940:xnack+;gfx941:xnack+;gfx942:xnack+"
)
else()
rocm_check_target_ids(DEFAULT_AMDGPU_TARGETS
TARGETS "gfx803;gfx900:xnack-;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack-;gfx90a:xnack+;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1200;gfx1201"
)
endif()

set(GPU_TARGETS "${DEFAULT_AMDGPU_TARGETS}" CACHE STRING "GPU architectures to compile for" FORCE)
rocm_check_target_ids(DEFAULT_AMDGPU_TARGETS
TARGETS "gfx803;gfx900:xnack-;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack-;gfx90a:xnack+;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1200;gfx1201"
)
endif()

# TODO: Fix VerifyCompiler for HIP on Windows
if (NOT WIN32)
include(cmake/VerifyCompiler.cmake)
endif()
list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH} ${ROCM_PATH}/hip ${ROCM_PATH}/llvm ${ROCM_ROOT}/llvm ${ROCM_ROOT} ${ROCM_ROOT}/hip)
find_package(hip REQUIRED CONFIG PATHS ${HIP_DIR} ${ROCM_PATH} /opt/rocm)
set(GPU_TARGETS "${DEFAULT_AMDGPU_TARGETS}" CACHE STRING "GPU architectures to compile for" FORCE)
endif()

# TODO: Fix VerifyCompiler for HIP on Windows
if (NOT WIN32)
include(cmake/VerifyCompiler.cmake)
endif()
list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH} ${ROCM_PATH}/hip ${ROCM_PATH}/llvm ${ROCM_ROOT}/llvm ${ROCM_ROOT} ${ROCM_ROOT}/hip)
find_package(hip REQUIRED CONFIG PATHS ${HIP_DIR} ${ROCM_PATH} /opt/rocm)

# FOR HANDLING ENABLE/DISABLE OPTIONAL BACKWARD COMPATIBILITY for FILE/FOLDER REORG
option(BUILD_FILE_REORG_BACKWARD_COMPATIBILITY "Build with file/folder reorg with backward compatibility enabled" OFF)
if(ROCPRIM_INSTALL AND BUILD_FILE_REORG_BACKWARD_COMPATIBILITY AND NOT WIN32)
Expand All @@ -130,11 +136,6 @@ if(BUILD_CODE_COVERAGE)
add_link_options(--coverage)
endif()

if(USE_HIP_CPU)
# Get dependencies
include(cmake/Dependencies.cmake)
endif()

# Setup VERSION
set(VERSION_STRING "3.3.0")
rocm_setup_version(VERSION ${VERSION_STRING})
Expand Down
7 changes: 1 addition & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ develop performant GPU-accelerated code on AMD ROCm platforms.
* Including
[HIP-clang](https://github.com/ROCm/HIP/blob/master/INSTALL.md#hip-clang)
compiler
* C++14
* C++17
* Python 3.6 or higher (HIP on Windows only, required only for install script)
* Visual Studio 2019 with Clang support (HIP on Windows only)
* Strawberry Perl (HIP on Windows only)
Expand Down Expand Up @@ -110,11 +110,6 @@ You can build and install rocPRIM on Linux or Windows.
# before 'cmake' or setting cmake option 'CMAKE_CXX_COMPILER' to path to the compiler.
# Using HIP-clang:
[CXX=hipcc] cmake -DBUILD_BENCHMARK=ON ../.
#
# ! EXPERIMENTAL !
# Alternatively one may build using the experimental (and highly incomplete) HIP-CPU back-end for host-side
# execution using any C++17 conforming compiler (supported by HIP-CPU). AMDGPU_* options are unavailable in this case.
# USE_HIP_CPU - OFF by default

# Build
make -j4
Expand Down
26 changes: 8 additions & 18 deletions benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -77,24 +77,10 @@ function(add_rocprim_benchmark BENCHMARK_SOURCE)
rocprim
benchmark::benchmark
)
if(NOT USE_HIP_CPU)
target_link_libraries(${BENCHMARK_TARGET}
PRIVATE
rocprim_hip
)
else()
target_link_libraries(${BENCHMARK_TARGET}
PRIVATE
Threads::Threads
hip_cpu_rt::hip_cpu_rt
)
if(STL_DEPENDS_ON_TBB)
target_link_libraries(${BENCHMARK_TARGET}
PRIVATE
TBB::tbb
)
endif()
endif()
target_link_libraries(${BENCHMARK_TARGET}
PRIVATE
rocprim_hip
)

target_compile_options(${BENCHMARK_TARGET}
PRIVATE
Expand Down Expand Up @@ -143,9 +129,11 @@ add_rocprim_benchmark(benchmark_block_scan.cpp)
add_rocprim_benchmark(benchmark_block_sort.cpp)
add_rocprim_benchmark(benchmark_config_dispatch.cpp)
add_rocprim_benchmark(benchmark_device_adjacent_difference.cpp)
add_rocprim_benchmark(benchmark_device_adjacent_find.cpp)
add_rocprim_benchmark(benchmark_device_batch_memcpy.cpp)
add_rocprim_benchmark(benchmark_device_binary_search.cpp)
add_rocprim_benchmark(benchmark_device_find_first_of.cpp)
add_rocprim_benchmark(benchmark_device_find_end.cpp)
add_rocprim_benchmark(benchmark_device_histogram.cpp)
add_rocprim_benchmark(benchmark_device_merge.cpp)
add_rocprim_benchmark(benchmark_device_merge_sort.cpp)
Expand All @@ -165,7 +153,9 @@ add_rocprim_benchmark(benchmark_device_run_length_encode.cpp)
add_rocprim_benchmark(benchmark_device_scan.cpp)
add_rocprim_benchmark(benchmark_device_scan_deterministic.cpp)
add_rocprim_benchmark(benchmark_device_scan_by_key.cpp)
add_rocprim_benchmark(benchmark_device_search.cpp)
add_rocprim_benchmark(benchmark_device_scan_by_key_deterministic.cpp)
add_rocprim_benchmark(benchmark_device_search_n.cpp)
add_rocprim_benchmark(benchmark_device_select.cpp)
add_rocprim_benchmark(benchmark_device_segmented_radix_sort_keys.cpp)
add_rocprim_benchmark(benchmark_device_segmented_radix_sort_pairs.cpp)
Expand Down
8 changes: 8 additions & 0 deletions benchmark/ConfigAutotuneSettings.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ function(read_config_autotune_settings file list_across_names list_across output
set(list_across "${TUNING_TYPES};\
true;false true;32 64 128 256 512 1024" PARENT_SCOPE)
set(output_pattern_suffix "@DataType@_@Left@_@InPlace@_@BlockSize@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_adjacent_find")
set(list_across_names "InputType;BlockSize" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};64 128 256 512 1024" PARENT_SCOPE)
set(output_pattern_suffix "@InputType@_@BlockSize@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_histogram")
set(list_across_names "DataType;BlockSize" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};64 128 256" PARENT_SCOPE)
Expand Down Expand Up @@ -115,5 +119,9 @@ DataType;BlockSize;" PARENT_SCOPE)
set(list_across_names "DataType;BlockSize" PARENT_SCOPE)
set(list_across "${LIMITED_TUNING_TYPES};32 64 128 256 512 1024" PARENT_SCOPE)
set(output_pattern_suffix "@DataType@_@BlockSize@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_merge")
set(list_across_names "KeyType;ValueType;BlockSize" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};rocprim::empty_type ${LIMITED_TUNING_TYPES};32 64 128 256 512 1024" PARENT_SCOPE)
set(output_pattern_suffix "@KeyType@_@ValueType@_@BlockSize@" PARENT_SCOPE)
endif()
endfunction()
Loading

0 comments on commit 4008834

Please sign in to comment.