Skip to content

rocBLAS-2.26.0 for ROCm 3.7.0

Compare
Choose a tag to compare
@saadrahim saadrahim released this 15 Aug 04:26
9d98138

New Features

  • Improvements to User Guide and Design Document
  • L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
  • L1 dot function added x dot x optimized kernel
  • Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
  • Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
  • Added Fortran interface for all rocBLAS functions
  • Improvements to rocblas_Xgemm_batched performance for small m, n, k.
  • Improvements to rocblas_Xgemv_batched and rocblas_Xgemv_strided_batched performance for small m (QMCPACK use).
  • Improvements to rocblas_Xdot (batched and non-batched) performance when both incx and incy are 1
  • Improvements to FP32 ONNX BERT performance for MI50
  • Significant improvements to FP32 Resnext, Inception Convolution performance for gfx908
  • Slight improvements to FP32 DLRM Terabyte performance for gfx908
  • Significant improvements to FP32 BDAS performance for gfx908
  • Significant improvements to FP32 BDAS performance for MI50 and MI60
  • Added substitution method for small trsm sizes with m <= 64 && n <= 64. Increases performance drastically for small batched trsm.

Known Issues

  • None