You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a recent study on Frontier, a 7-point stencil kernel under performs at half the bandwidth (~300 GB/s) of its HIP counterpart (~600 GB/s) on a single MI250x. The behavior is replicated at large scale up to 4K GPUs. While this was a first attempt using AMDGPU v0.4
Some to-do items:
Test with AMDGPU.jl v0.5 and onwards
Understand the performance difference wrt HIP 7-point stencil driven by a Laplacian operator available here
That's a really good point. Thanks for getting this started. As FYI, running AMDGPU on LUMI with AMDGPU v0.7.4 does not show major performance difference between HIP C++ and AMDGPU. Part of the story could be that we switched from HSA to HIP internally in AMDGPU, and that @pxl-th did massive refactoring work (TY).
While developing FastIce (https://github.com/PTsolvers/FastIce.jl) which should run optimally on LUMI, we encountered quite some challenges using the Julia GPU stack which now prefers Julia task-based parallelism instead of events as it used to be. @utkinis thus started a project called HPCBenchmarks https://github.com/PTsolvers/HPCBenchmarks.jl, where we try to compare host-overhead, memcpy, 2D and 3D laplacian for CUDA.jl and AMDGPU.jl versus respective C++ CUDA and HIP counterparts. Benchmark are designed to populate a BenchmarkGroup matrix (from BenchmarkTools.jl) which can be further used for analysis.
I just updated the suite to make sure it runs on AMDGPU and CUDA. Only host-overhead needs to be fixed.
Suggestion: Maybe one could add this repo to JuliaGPU org, and extend it with relevant benchmarks to run it part of GPU CI, and also allow people to pull it and run it on HPC centre CI. Further additions could be:
In a recent study on Frontier, a 7-point stencil kernel under performs at half the bandwidth (~300 GB/s) of its HIP counterpart (~600 GB/s) on a single MI250x. The behavior is replicated at large scale up to 4K GPUs. While this was a first attempt using AMDGPU v0.4
Some to-do items:
Opening after discussion in the HPC call with @vchuravy and @gbaraldi
The text was updated successfully, but these errors were encountered: