[WIP] Extend MultiRegion to NonhydrostaticModel #2523

simone-silvestri · 2022-05-05T14:24:08Z

Now that finally MultiRegion is merged we can implement the single node multi GPU paradigm also in the Nonhydrostatic model

cc @tomchor

The work can be divided in three tasks

Adapt the NonhydrostaticModel to accept a MultiRegionGrid. i.e., wrap local function calls in @apply_regionally and extend global methods in multi_region_models.jl.
Expose the parallelism in RungeKutta3 timestepper and in the update_state! method. This is achieved by lumping together local function calls (all possible kernel calls such as calculate tendencies, rk substeps, etc) in outer functions and wrapping everything in @apply_regionally
Implement a multi GPU pressure solver. This can be achieved in a couple of different ways. (1) transpose local memory and perform one direction FFT at the time (at we do now in the Distributed module through PencilArrays). (2) exploit the multi GPU capabilities of cuda through the cufftxt library that can perform single node distributed FFT to up to 16 GPUs. (3) Allocate storage and plan in Unified memory and perform the FFT in only one GPU. Ideally we would implement (3) only if we are desperate. The best solution would be to go with method (2), as (1) incurs in hefty memory transfer costs (I am not sure as to how the cufftxt implements multi GPU FFT though)

The first two tasks are quite trivial so I think the bulk of the work will be on implementing the pressure solver

glwagner · 2022-05-05T15:07:07Z

Expose the parallelism in RungeKutta3 timestepper and in the update_state! method. This is achieved by lumping together local function calls (all possible kernel calls such as calculate tendencies, rk substeps, etc) in outer functions and wrapping everything in @apply_regionally

This is not strictly necessary right? Just if we want to also support RungeKutta3.

simone-silvestri · 2022-05-05T15:12:26Z

you're right. But it is quite easy to do so

glwagner · 2022-05-05T15:14:53Z

Implement a multi GPU pressure solver. This can be achieved in a couple of different ways. (1) transpose local memory and perform one direction FFT at the time (at we do now in the Distributed module through PencilArrays). (2) exploit the multi GPU capabilities of cuda through the cufftxt library that can perform single node distributed FFT to up to 16 GPUs. (3) Allocate storage and plan in Unified memory and perform the FFT in only one GPU. Ideally we would implement (3) only if we are desperate. The best solution would be to go with method (2), as (1) incurs in hefty memory transfer costs (I am not sure as to how the cufftxt implements multi GPU FFT though)

I think (but am not 100% sure) that PencilArrays is tied to MPI. So I guess for 1 we are either "borrowing" (but not using directly) the transpose algorithm from PencilArrays, or we are extending the code so that it works "without MPI" -- and also with GPUs, which is work in progress: jipolanco/PencilFFTs.jl#37

Another reason to focus on 2 is that we can use PencilFFTs for distributed (but not MultiRegion) nonhydrostatic model. So even if PencilFFTs had support for CuArray now (and if Distributed were performant for multi GPU --- both of which may not be too close), using cufftxt could still potentially be motivated by performance reasons.

In other words, if we develop a capability with cufftxt, then in the future we can also support multi GPU via PencilFFT and Distributed, and we have two options (one perhaps performant on single node, multi GPU, and another to use when you need GPUs spread across multiple nodes).

glwagner · 2022-05-05T15:16:09Z

Here's some description of cufftxt:

https://docs.nvidia.com/cuda/cufft/index.html#multiple-GPU-cufft-transforms

glwagner · 2022-05-06T04:47:16Z

src/Models/NonhydrostaticModels/update_nonhydrostatic_model_state.jl

+    return nothing
+end
+
+function local_calculations!(model)


Suggested change

function local_calculations!(model)

function local_nonhydrostatic_state_update!(model)

even though this function appears right next to where it's used, it's possible that we encounter it in external contexts (ie in an error). So it's helpful if we give it a descriptive name (since it's not much extra effort to do that, we imght as well...)

simone-silvestri · 2022-05-06T17:48:52Z

A new idea for extending the pressure solver to MultiRegion is to divide the FFT computations into "local" and "non-local" directions.

The FFT in local directions can be easily performed wrapping the transform in @apply_regionally
For non local directions, if storage and plan are unified_arrays, the non local FFT can be performed by permuting the partitioning of the MultiRegionGrid without having to transpose memory (that will happen under the hood thanks to unified memory).

This strategy would not be easily extensible to generally subdivided regions and will play well only with one direction partitioning, but given the current limitations of the FFT solver (only regular grid), I think it is a good strategy to get something to work

glwagner · 2022-05-06T18:13:28Z

For non local directions, if storage and plan are unified_arrays, the non local FFT can be performed by permuting the partitioning of the MultiRegionGrid without having to transpose memory (that will happen under the hood thanks to unified memory).

Just storage is an array; plan is a CUFFT object, not an array.

If we use cufftxt would this happen be default?

ie with cufftxt we have to build a unified storage (and maybe unified eigenvalues). Then provided we can fill up storage correctly, and empty it correctly at the end, the thing that's left is to "just do" the fft (transposes etc handled under the hood).

Does broadcasting work with unified memory arrays?

simone-silvestri · 2022-05-06T18:18:15Z

yep broadcasting works. My thought was that plan can be hacked to store unified memory. I still have to look at the data structure to see how to do it.

cufftxt basically works in the same way (local FFT direction distributed among the workers then transpose and nonlocal FFT). I am not sure they use unified memory but they for sure use transposes
https://on-demand.gputechconf.com/gtc/2014/presentations/S4788-rapid-multi-gpu-programming-cuda-libraries.pdf

glwagner · 2022-05-06T18:44:28Z

yep broadcasting works. My thought was that plan can be hacked to store unified memory. I still have to look at the data structure to see how to do it.

cufftxt basically works in the same way (local FFT direction distributed among the workers then transpose and nonlocal FFT). I am not sure they use unified memory but they for sure use transposes https://on-demand.gputechconf.com/gtc/2014/presentations/S4788-rapid-multi-gpu-programming-cuda-libraries.pdf

Mmm ok. Is this proposal a way to avoid cuffxt basically? I think what you outlined is somehow roughly how PencilFTTs work:

FFT along local direction (dim=1)
Simultaneously communicate and permute data to (2, 1, 3) (or is it (2, 3, 1)?)
FFT along local direction (dim=2)
Simultaneously communicate and permute data to (3, 2, 1)
FFT along dim=3.

At the end, the data has permutation (3, 2, 1). The backwards transform then reverses this process. solver.storage is actually a tuple of 3 preallocated arrays to support this algorithm.

For the tridiagonal solver I think we want to use the same algorithm, except that we skip step 1 (ie the first step is to communicate and permute data with no transform). Once the two other transforms are complete we have data in the configuration (x, y, z) where z is local, and we can do a tridiagonal solve in eigenfunction space. Then we transform back and obtain data back in the (z, y, x) permutation, with z local, and copy into the pressure.

We have to extend the tridiagonal solver to accomodate this kind of permutation for distributed CPU, so if we have an algorithm like the one above we can then also use it for MultiRegionGrid solves on the GPU.

simone-silvestri · 2022-05-06T18:47:54Z

Exactly, in practice a transpose is communication + permutation, but with unified_memory we avoid the communication part (I mean it is there but it's handled by CUDA)

glwagner · 2022-05-06T18:51:07Z

Exactly, in practice a transpose is communication + permutation, but with unified_memory we avoid the communication part (I mean it is there but it's handled by CUDA)

So maybe we need three unified memory arrays for solver.storage, each permuted with respect to one another?

I was thinking it'd be nice to avoid coding it ourselves by using cufftxt, but now that we're talking about it doesn't seem too difficult.

navidcy · 2023-07-01T10:48:20Z

Is this PR superseded by #2795?

tomchor · 2023-07-01T17:22:58Z

Is this PR superseded by #2795?

I think so. Although somehow this one seems to have fewer conflicts...

glwagner · 2023-07-05T16:25:07Z

Also in my opinion we should implement a distributed GPU nonhydrostatic model first. MultiRegion is a cherry on top for sure, but not essential.

simone-silvestri · 2023-07-11T07:32:20Z

yeah, #2795 does supersede this PR, although #2795 has a custom implementation of transpose + FFT which we might want to avoid now that pencilFFT supports CuArrays

simone-silvestri · 2023-09-05T12:08:46Z

Stale and superseded by #2795, I ll close

first commit

cc25f86

simone-silvestri added experimental feature 🧪 Because danger is the spice of life distributed 🕸️ Our plan for total cluster domination labels May 5, 2022

simone-silvestri marked this pull request as draft May 5, 2022 14:54

task number 1 complete

3fba2f0

glwagner reviewed May 6, 2022

View reviewed changes

allow with_halo on nonhydrostatic

038d5be

tomchor mentioned this pull request May 26, 2022

Odd behavior using ImmersedBoundary while specifying scratch data for Field computation #2581

Closed

simone-silvestri closed this Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Extend MultiRegion to NonhydrostaticModel #2523

[WIP] Extend MultiRegion to NonhydrostaticModel #2523

simone-silvestri commented May 5, 2022 •

edited

Loading

glwagner commented May 5, 2022

simone-silvestri commented May 5, 2022

glwagner commented May 5, 2022

glwagner commented May 5, 2022

glwagner May 6, 2022

simone-silvestri commented May 6, 2022 •

edited

Loading

glwagner commented May 6, 2022

simone-silvestri commented May 6, 2022 •

edited

Loading

glwagner commented May 6, 2022 •

edited

Loading

simone-silvestri commented May 6, 2022

glwagner commented May 6, 2022

navidcy commented Jul 1, 2023

tomchor commented Jul 1, 2023

glwagner commented Jul 5, 2023

simone-silvestri commented Jul 11, 2023

simone-silvestri commented Sep 5, 2023

	function local_calculations!(model)
	function local_nonhydrostatic_state_update!(model)

[WIP] Extend MultiRegion to NonhydrostaticModel #2523

[WIP] Extend MultiRegion to NonhydrostaticModel #2523

Conversation

simone-silvestri commented May 5, 2022 • edited Loading

glwagner commented May 5, 2022

simone-silvestri commented May 5, 2022

glwagner commented May 5, 2022

glwagner commented May 5, 2022

glwagner May 6, 2022

Choose a reason for hiding this comment

simone-silvestri commented May 6, 2022 • edited Loading

glwagner commented May 6, 2022

simone-silvestri commented May 6, 2022 • edited Loading

glwagner commented May 6, 2022 • edited Loading

simone-silvestri commented May 6, 2022

glwagner commented May 6, 2022

navidcy commented Jul 1, 2023

tomchor commented Jul 1, 2023

glwagner commented Jul 5, 2023

simone-silvestri commented Jul 11, 2023

simone-silvestri commented Sep 5, 2023

simone-silvestri commented May 5, 2022 •

edited

Loading

simone-silvestri commented May 6, 2022 •

edited

Loading

simone-silvestri commented May 6, 2022 •

edited

Loading

glwagner commented May 6, 2022 •

edited

Loading