Optimisation of Runge-Kutta MultiFab allocation #4263

the-florist · 2024-12-13T14:14:07Z

While optimising GRTeclyn on Nvidia GPUs, we found that at the beginning of each call to the Runge-Kutta function in Amrex, the computational load was switched onto the CPU, causing a slow-down. We traced this switch to the beginning of the RK4 function (line 244 in AMReX_RungeKutta.H), where a new set of MultiFabs are created to store the RK steps each time the function is called (lines 251-255 in AMReX_RungeKutta.H). We suspect that a speed-up could be achieved if this memory is instead allocated at the creation of the amrex::RungeKutta class, and passed as an array of aliased MultiFabs to each RungeKutta sub-routine.

I have included snapshots of the Nsight timeline, displayed using the Nsight Systems viewer, which show the jump in computational load between the CPU and GPU, as well as the corresponding NVTX diagnostic noting the beginning of the RungeKutta4 function. I am happy to share the .nsys-rep file itself, but due to its size I cannot upload it to this issue, so please contact me if you would like a copy.

WeiqunZhang · 2024-12-14T18:19:54Z

Have you changed the default parameters of arenas such as amrex.the_arena_release_threshold? If not, MultiFab memory allocation is usually a one-time cost. After a few steps, the cost should be very small.

the-florist changed the title ~~Optimisation of Runge-Kutta class~~ Optimisation of Runge-Kutta MultiFab allocation Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisation of Runge-Kutta MultiFab allocation #4263

Optimisation of Runge-Kutta MultiFab allocation #4263

the-florist commented Dec 13, 2024

WeiqunZhang commented Dec 14, 2024

Optimisation of Runge-Kutta MultiFab allocation #4263

Optimisation of Runge-Kutta MultiFab allocation #4263

Comments

the-florist commented Dec 13, 2024

WeiqunZhang commented Dec 14, 2024