Releases: ROCm/rocBLAS
Releases · ROCm/rocBLAS
rocBLAS 3.1.0 for ROCm 5.7.1
rocBLAS code for ROCm 5.7.1 did not change. The library was rebuilt for the updated ROCm 5.7.1 stack.
rocBLAS 3.1.0 for ROCm 5.7.0
Added
- yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
- rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
Fixed
- make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:
- Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
- Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
- Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
- General: set_vector, get_vector, set_matrix, get_matrix
- Related fixes: internal scalar loads with > 32bit offsets
- fix in-place functionality for all trtri sizes
Changed
- dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
- enhanced reporting of installation issues caused by runtime libraries (Tensile)
- standardized internal rocblas C++ interface across most functions
Deprecated
- Removal of STDC_WANT_IEC_60559_TYPES_EXT define in future release
Dependencies
- optional use of AOCL BLIS 4.0 on Linux for clients
- optional build tool only dependency on python psutil
rocBLAS 3.0.0 for ROCm 5.6.1
rocBLAS code for ROCm 5.6.1 did not change. The library was rebuilt for the updated ROCm 5.6.1 stack.
rocBLAS 3.0.0 for ROCm 5.6.0
Optimizations
- Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256.
- Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a.
Added
- Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex.
Deprecated
- trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality
- rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release
- rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release
- rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size()
- rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release
Removed
- is_complex helper was deprecated and now removed. Use rocblas_is_complex instead.
- The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively.
- rocblas_set_int8_type_for_hipblas was deprecated and is now removed.
- rocblas_get_int8_type_for_hipblas was deprecated and is now removed.
Dependencies
- build only dependency on python joblib added as used by Tensile build
- fix for cmake install on some OS when performed by install.sh -d --cmake_install
Fixed
- make trsm offset calculations 64 bit safe
Changed
- refactor rotg test code
rocBLAS 2.47.0 for ROCm 5.5.1
rocBLAS code for ROCm 5.5.1 did not change. The library was rebuilt for the updated ROCm 5.5.1 stack.
rocBLAS 2.47.0 for ROCm 5.5.0
Added
- added functionality rocblas_geam_ex for matrix-matrix minimum operations
- added HIP Graph support as beta feature for rocBLAS Level 1, Level 2, and Level 3(pointer mode host) functions
- added beta features API. Exposed using compiler define ROCBLAS_BETA_FEATURES_API
- added support for vector initialization in the rocBLAS test framework with negative increments
- added windows build documentation for forthcoming support using ROCm HIP SDK
- added scripts to plot performance for multiple functions
Optimizations
- improved performance of Level 2 rocBLAS GEMV for float and double precision. Performance enhanced by 150-200% for certain problem sizes when (m==n) measured on a gfx90a GPU.
- improved performance of Level 2 rocBLAS GER for float, double and complex float precisions. Performance enhanced by 5-7% for certain problem sizes measured on a gfx90a GPU.
- improved performance of Level 2 rocBLAS SYMV for float and double precisions. Performance enhanced by 120-150% for certain problem sizes measured on both gfx908 and gfx90a GPUs.
Fixed
- fixed setting of executable mode on client script rocblas_gentest.py to avoid potential permission errors with clients rocblas-test and rocblas-bench
- fixed deprecated API compatibility with Visual Studio compiler
- fixed test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory
Changed
- install.sh internally runs rmake.py (also used on windows) and rmake.py may be used directly by developers on linux (use --help)
- rocblas client executables all now begin with rocblas- prefix
Removed
- install.sh removed options -o --cov as now Tensile will use the default COV format, set by cmake define Tensile_CODE_OBJECT_VERSION=default
rocBLAS 2.46.0 for ROCm 5.4.4
rocBLAS code for ROCm 5.4.4 did not change. The library was rebuilt for the updated ROCm 5.4.4 stack.
rocBLAS 2.46.0 for ROCm 5.4.3
rocBLAS code for ROCm 5.4.3 did not change. The library was rebuilt for the updated ROCm 5.4.3 stack.
rocBLAS 2.46.0 for ROCm 5.4.2
rocBLAS code for ROCm 5.4.2 did not change. The library was rebuilt for the updated ROCm 5.4.2 stack.
rocBLAS 2.46.0 for ROCm 5.4.1
rocBLAS code for ROCm 5.4.1 did not change. The library was rebuilt for the updated ROCm 5.4.1 stack.