Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Extend MultiRegion to NonhydrostaticModel #2523

Closed
wants to merge 3 commits into from

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented May 5, 2022

Now that finally MultiRegion is merged we can implement the single node multi GPU paradigm also in the Nonhydrostatic model

cc @tomchor

The work can be divided in three tasks

  • Adapt the NonhydrostaticModel to accept a MultiRegionGrid. i.e., wrap local function calls in @apply_regionally and extend global methods in multi_region_models.jl.
  • Expose the parallelism in RungeKutta3 timestepper and in the update_state! method. This is achieved by lumping together local function calls (all possible kernel calls such as calculate tendencies, rk substeps, etc) in outer functions and wrapping everything in @apply_regionally
  • Implement a multi GPU pressure solver. This can be achieved in a couple of different ways. (1) transpose local memory and perform one direction FFT at the time (at we do now in the Distributed module through PencilArrays). (2) exploit the multi GPU capabilities of cuda through the cufftxt library that can perform single node distributed FFT to up to 16 GPUs. (3) Allocate storage and plan in Unified memory and perform the FFT in only one GPU. Ideally we would implement (3) only if we are desperate. The best solution would be to go with method (2), as (1) incurs in hefty memory transfer costs (I am not sure as to how the cufftxt implements multi GPU FFT though)

The first two tasks are quite trivial so I think the bulk of the work will be on implementing the pressure solver

@simone-silvestri simone-silvestri added experimental feature 🧪 Because danger is the spice of life distributed 🕸️ Our plan for total cluster domination labels May 5, 2022
@simone-silvestri simone-silvestri marked this pull request as draft May 5, 2022 14:54
@glwagner
Copy link
Member

glwagner commented May 5, 2022

Expose the parallelism in RungeKutta3 timestepper and in the update_state! method. This is achieved by lumping together local function calls (all possible kernel calls such as calculate tendencies, rk substeps, etc) in outer functions and wrapping everything in @apply_regionally

This is not strictly necessary right? Just if we want to also support RungeKutta3.

@simone-silvestri
Copy link
Collaborator Author

you're right. But it is quite easy to do so

@glwagner
Copy link
Member

glwagner commented May 5, 2022

Implement a multi GPU pressure solver. This can be achieved in a couple of different ways. (1) transpose local memory and perform one direction FFT at the time (at we do now in the Distributed module through PencilArrays). (2) exploit the multi GPU capabilities of cuda through the cufftxt library that can perform single node distributed FFT to up to 16 GPUs. (3) Allocate storage and plan in Unified memory and perform the FFT in only one GPU. Ideally we would implement (3) only if we are desperate. The best solution would be to go with method (2), as (1) incurs in hefty memory transfer costs (I am not sure as to how the cufftxt implements multi GPU FFT though)

I think (but am not 100% sure) that PencilArrays is tied to MPI. So I guess for 1 we are either "borrowing" (but not using directly) the transpose algorithm from PencilArrays, or we are extending the code so that it works "without MPI" -- and also with GPUs, which is work in progress: jipolanco/PencilFFTs.jl#37

Another reason to focus on 2 is that we can use PencilFFTs for distributed (but not MultiRegion) nonhydrostatic model. So even if PencilFFTs had support for CuArray now (and if Distributed were performant for multi GPU --- both of which may not be too close), using cufftxt could still potentially be motivated by performance reasons.

In other words, if we develop a capability with cufftxt, then in the future we can also support multi GPU via PencilFFT and Distributed, and we have two options (one perhaps performant on single node, multi GPU, and another to use when you need GPUs spread across multiple nodes).

@glwagner
Copy link
Member

glwagner commented May 5, 2022

Here's some description of cufftxt:

https://docs.nvidia.com/cuda/cufft/index.html#multiple-GPU-cufft-transforms

return nothing
end

function local_calculations!(model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function local_calculations!(model)
function local_nonhydrostatic_state_update!(model)

even though this function appears right next to where it's used, it's possible that we encounter it in external contexts (ie in an error). So it's helpful if we give it a descriptive name (since it's not much extra effort to do that, we imght as well...)

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented May 6, 2022

A new idea for extending the pressure solver to MultiRegion is to divide the FFT computations into "local" and "non-local" directions.

The FFT in local directions can be easily performed wrapping the transform in @apply_regionally
For non local directions, if storage and plan are unified_arrays, the non local FFT can be performed by permuting the partitioning of the MultiRegionGrid without having to transpose memory (that will happen under the hood thanks to unified memory).

This strategy would not be easily extensible to generally subdivided regions and will play well only with one direction partitioning, but given the current limitations of the FFT solver (only regular grid), I think it is a good strategy to get something to work

@glwagner
Copy link
Member

glwagner commented May 6, 2022

For non local directions, if storage and plan are unified_arrays, the non local FFT can be performed by permuting the partitioning of the MultiRegionGrid without having to transpose memory (that will happen under the hood thanks to unified memory).

Just storage is an array; plan is a CUFFT object, not an array.

If we use cufftxt would this happen be default?

ie with cufftxt we have to build a unified storage (and maybe unified eigenvalues). Then provided we can fill up storage correctly, and empty it correctly at the end, the thing that's left is to "just do" the fft (transposes etc handled under the hood).

Does broadcasting work with unified memory arrays?

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented May 6, 2022

yep broadcasting works. My thought was that plan can be hacked to store unified memory. I still have to look at the data structure to see how to do it.

cufftxt basically works in the same way (local FFT direction distributed among the workers then transpose and nonlocal FFT). I am not sure they use unified memory but they for sure use transposes
https://on-demand.gputechconf.com/gtc/2014/presentations/S4788-rapid-multi-gpu-programming-cuda-libraries.pdf

@glwagner
Copy link
Member

glwagner commented May 6, 2022

yep broadcasting works. My thought was that plan can be hacked to store unified memory. I still have to look at the data structure to see how to do it.

cufftxt basically works in the same way (local FFT direction distributed among the workers then transpose and nonlocal FFT). I am not sure they use unified memory but they for sure use transposes https://on-demand.gputechconf.com/gtc/2014/presentations/S4788-rapid-multi-gpu-programming-cuda-libraries.pdf

Mmm ok. Is this proposal a way to avoid cuffxt basically? I think what you outlined is somehow roughly how PencilFTTs work:

  1. FFT along local direction (dim=1)
  2. Simultaneously communicate and permute data to (2, 1, 3) (or is it (2, 3, 1)?)
  3. FFT along local direction (dim=2)
  4. Simultaneously communicate and permute data to (3, 2, 1)
  5. FFT along dim=3.

At the end, the data has permutation (3, 2, 1). The backwards transform then reverses this process. solver.storage is actually a tuple of 3 preallocated arrays to support this algorithm.

For the tridiagonal solver I think we want to use the same algorithm, except that we skip step 1 (ie the first step is to communicate and permute data with no transform). Once the two other transforms are complete we have data in the configuration (x, y, z) where z is local, and we can do a tridiagonal solve in eigenfunction space. Then we transform back and obtain data back in the (z, y, x) permutation, with z local, and copy into the pressure.

We have to extend the tridiagonal solver to accomodate this kind of permutation for distributed CPU, so if we have an algorithm like the one above we can then also use it for MultiRegionGrid solves on the GPU.

@simone-silvestri
Copy link
Collaborator Author

Exactly, in practice a transpose is communication + permutation, but with unified_memory we avoid the communication part (I mean it is there but it's handled by CUDA)

@glwagner
Copy link
Member

glwagner commented May 6, 2022

Exactly, in practice a transpose is communication + permutation, but with unified_memory we avoid the communication part (I mean it is there but it's handled by CUDA)

So maybe we need three unified memory arrays for solver.storage, each permuted with respect to one another?

I was thinking it'd be nice to avoid coding it ourselves by using cufftxt, but now that we're talking about it doesn't seem too difficult.

@navidcy
Copy link
Collaborator

navidcy commented Jul 1, 2023

Is this PR superseded by #2795?

@tomchor
Copy link
Collaborator

tomchor commented Jul 1, 2023

Is this PR superseded by #2795?

I think so. Although somehow this one seems to have fewer conflicts...

@glwagner
Copy link
Member

glwagner commented Jul 5, 2023

Also in my opinion we should implement a distributed GPU nonhydrostatic model first. MultiRegion is a cherry on top for sure, but not essential.

@simone-silvestri
Copy link
Collaborator Author

yeah, #2795 does supersede this PR, although #2795 has a custom implementation of transpose + FFT which we might want to avoid now that pencilFFT supports CuArrays

@simone-silvestri
Copy link
Collaborator Author

Stale and superseded by #2795, I ll close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed 🕸️ Our plan for total cluster domination experimental feature 🧪 Because danger is the spice of life
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants