Skip to content

pranjalssh/fast.cu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fastest GPU kernels, written from scratch.

Matrix Multiplication

Matrix multiplication of square bf16 matrices, accumulated in fp32.

N=4096
Kernel: 763 TFLOPs
cuBLAS: 716 TFLOPs

N=8192
Kernel: 808 TFLOPs
cuBLAS: 795 TFLOPs

Explanation in https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog

To run:
make matmul && out/matmul

Example kernels are in examples/matmul/ and orchestration is in matmul.cu

Sum reduction

We compute sum of 2^30 elements.

To run:
make sum && out/sum
Kernel: 3240.11 GB/s
cub Library: 3193 GB/s

Example kernels are in sum.cu

About

Fastest kernels written from scratch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •