rocBLAS-2.26.0 for ROCm 3.7.0
New Features
- Improvements to User Guide and Design Document
- L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
- L1 dot function added x dot x optimized kernel
- Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
- Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
- Added Fortran interface for all rocBLAS functions
- Improvements to rocblas_Xgemm_batched performance for small m, n, k.
- Improvements to rocblas_Xgemv_batched and rocblas_Xgemv_strided_batched performance for small m (QMCPACK use).
- Improvements to rocblas_Xdot (batched and non-batched) performance when both incx and incy are 1
- Improvements to FP32 ONNX BERT performance for MI50
- Significant improvements to FP32 Resnext, Inception Convolution performance for gfx908
- Slight improvements to FP32 DLRM Terabyte performance for gfx908
- Significant improvements to FP32 BDAS performance for gfx908
- Significant improvements to FP32 BDAS performance for MI50 and MI60
- Added substitution method for small trsm sizes with m <= 64 && n <= 64. Increases performance drastically for small batched trsm.
Known Issues
- None