You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While optimising GRTeclyn on Nvidia GPUs, we found that at the beginning of each call to the Runge-Kutta function in Amrex, the computational load was switched onto the CPU, causing a slow-down. We traced this switch to the beginning of the RK4 function (line 244 in AMReX_RungeKutta.H), where a new set of MultiFabs are created to store the RK steps each time the function is called (lines 251-255 in AMReX_RungeKutta.H). We suspect that a speed-up could be achieved if this memory is instead allocated at the creation of the amrex::RungeKutta class, and passed as an array of aliased MultiFabs to each RungeKutta sub-routine.
I have included snapshots of the Nsight timeline, displayed using the Nsight Systems viewer, which show the jump in computational load between the CPU and GPU, as well as the corresponding NVTX diagnostic noting the beginning of the RungeKutta4 function. I am happy to share the .nsys-rep file itself, but due to its size I cannot upload it to this issue, so please contact me if you would like a copy.
The text was updated successfully, but these errors were encountered:
the-florist
changed the title
Optimisation of Runge-Kutta class
Optimisation of Runge-Kutta MultiFab allocation
Dec 13, 2024
Have you changed the default parameters of arenas such as amrex.the_arena_release_threshold? If not, MultiFab memory allocation is usually a one-time cost. After a few steps, the cost should be very small.
While optimising GRTeclyn on Nvidia GPUs, we found that at the beginning of each call to the Runge-Kutta function in Amrex, the computational load was switched onto the CPU, causing a slow-down. We traced this switch to the beginning of the RK4 function (line 244 in AMReX_RungeKutta.H), where a new set of MultiFabs are created to store the RK steps each time the function is called (lines 251-255 in AMReX_RungeKutta.H). We suspect that a speed-up could be achieved if this memory is instead allocated at the creation of the amrex::RungeKutta class, and passed as an array of aliased MultiFabs to each RungeKutta sub-routine.
I have included snapshots of the Nsight timeline, displayed using the Nsight Systems viewer, which show the jump in computational load between the CPU and GPU, as well as the corresponding NVTX diagnostic noting the beginning of the RungeKutta4 function. I am happy to share the .nsys-rep file itself, but due to its size I cannot upload it to this issue, so please contact me if you would like a copy.
The text was updated successfully, but these errors were encountered: