Performance gap on a 7-point stencil Laplacian kernel on Frontier MI250x GPUs #553

williamfgc · 2023-11-29T01:36:40Z

In a recent study on Frontier, a 7-point stencil kernel under performs at half the bandwidth (~300 GB/s) of its HIP counterpart (~600 GB/s) on a single MI250x. The behavior is replicated at large scale up to 4K GPUs. While this was a first attempt using AMDGPU v0.4

Some to-do items:

Test with AMDGPU.jl v0.5 and onwards
Understand the performance difference wrt HIP 7-point stencil driven by a Laplacian operator available here
Implement a performance test to avoid regression

Opening after discussion in the HPC call with @vchuravy and @gbaraldi

luraess · 2023-11-29T14:46:42Z

That's a really good point. Thanks for getting this started. As FYI, running AMDGPU on LUMI with AMDGPU v0.7.4 does not show major performance difference between HIP C++ and AMDGPU. Part of the story could be that we switched from HSA to HIP internally in AMDGPU, and that @pxl-th did massive refactoring work (TY).

While developing FastIce (https://github.com/PTsolvers/FastIce.jl) which should run optimally on LUMI, we encountered quite some challenges using the Julia GPU stack which now prefers Julia task-based parallelism instead of events as it used to be. @utkinis thus started a project called HPCBenchmarks https://github.com/PTsolvers/HPCBenchmarks.jl, where we try to compare host-overhead, memcpy, 2D and 3D laplacian for CUDA.jl and AMDGPU.jl versus respective C++ CUDA and HIP counterparts. Benchmark are designed to populate a BenchmarkGroup matrix (from BenchmarkTools.jl) which can be further used for analysis.

I just updated the suite to make sure it runs on AMDGPU and CUDA. Only host-overhead needs to be fixed.

Repo: https://github.com/PTsolvers/HPCBenchmarks.jl

Suggestion: Maybe one could add this repo to JuliaGPU org, and extend it with relevant benchmarks to run it part of GPU CI, and also allow people to pull it and run it on HPC centre CI. Further additions could be:

Reporting result matrix
Catching regression
MPI tests
Different kernels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance gap on a 7-point stencil Laplacian kernel on Frontier MI250x GPUs #553

Performance gap on a 7-point stencil Laplacian kernel on Frontier MI250x GPUs #553

williamfgc commented Nov 29, 2023

luraess commented Nov 29, 2023

Performance gap on a 7-point stencil Laplacian kernel on Frontier MI250x GPUs #553

Performance gap on a 7-point stencil Laplacian kernel on Frontier MI250x GPUs #553

Comments

williamfgc commented Nov 29, 2023

luraess commented Nov 29, 2023