-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Extend MultiRegion to NonhydrostaticModel #2523
Conversation
This is not strictly necessary right? Just if we want to also support RungeKutta3. |
you're right. But it is quite easy to do so |
I think (but am not 100% sure) that PencilArrays is tied to MPI. So I guess for 1 we are either "borrowing" (but not using directly) the transpose algorithm from PencilArrays, or we are extending the code so that it works "without MPI" -- and also with GPUs, which is work in progress: jipolanco/PencilFFTs.jl#37 Another reason to focus on 2 is that we can use PencilFFTs for distributed (but not In other words, if we develop a capability with cufftxt, then in the future we can also support multi GPU via PencilFFT and |
Here's some description of cufftxt: https://docs.nvidia.com/cuda/cufft/index.html#multiple-GPU-cufft-transforms |
return nothing | ||
end | ||
|
||
function local_calculations!(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function local_calculations!(model) | |
function local_nonhydrostatic_state_update!(model) |
even though this function appears right next to where it's used, it's possible that we encounter it in external contexts (ie in an error). So it's helpful if we give it a descriptive name (since it's not much extra effort to do that, we imght as well...)
A new idea for extending the pressure solver to The FFT in local directions can be easily performed wrapping the transform in This strategy would not be easily extensible to generally subdivided regions and will play well only with one direction partitioning, but given the current limitations of the FFT solver (only regular grid), I think it is a good strategy to get something to work |
Just If we use ie with Does broadcasting work with unified memory arrays? |
yep broadcasting works. My thought was that cufftxt basically works in the same way (local FFT direction distributed among the workers then transpose and nonlocal FFT). I am not sure they use unified memory but they for sure use transposes |
Mmm ok. Is this proposal a way to avoid cuffxt basically? I think what you outlined is somehow roughly how PencilFTTs work:
At the end, the data has permutation (3, 2, 1). The backwards transform then reverses this process. For the tridiagonal solver I think we want to use the same algorithm, except that we skip step 1 (ie the first step is to communicate and permute data with no transform). Once the two other transforms are complete we have data in the configuration (x, y, z) where z is local, and we can do a tridiagonal solve in eigenfunction space. Then we transform back and obtain data back in the (z, y, x) permutation, with z local, and copy into the pressure. We have to extend the tridiagonal solver to accomodate this kind of permutation for distributed CPU, so if we have an algorithm like the one above we can then also use it for MultiRegionGrid solves on the GPU. |
Exactly, in practice a transpose is communication + permutation, but with |
So maybe we need three unified memory arrays for I was thinking it'd be nice to avoid coding it ourselves by using cufftxt, but now that we're talking about it doesn't seem too difficult. |
Is this PR superseded by #2795? |
I think so. Although somehow this one seems to have fewer conflicts... |
Also in my opinion we should implement a distributed GPU nonhydrostatic model first. MultiRegion is a cherry on top for sure, but not essential. |
Stale and superseded by #2795, I ll close |
Now that finally
MultiRegion
is merged we can implement the single node multi GPU paradigm also in the Nonhydrostatic modelcc @tomchor
The work can be divided in three tasks
MultiRegionGrid
. i.e., wrap local function calls in@apply_regionally
and extend global methods inmulti_region_models.jl
.RungeKutta3
timestepper and in theupdate_state!
method. This is achieved by lumping together local function calls (all possible kernel calls such as calculate tendencies, rk substeps, etc) in outer functions and wrapping everything in@apply_regionally
Distributed
module through PencilArrays). (2) exploit the multi GPU capabilities of cuda through the cufftxt library that can perform single node distributed FFT to up to 16 GPUs. (3) Allocate storage and plan in Unified memory and perform the FFT in only one GPU. Ideally we would implement (3) only if we are desperate. The best solution would be to go with method (2), as (1) incurs in hefty memory transfer costs (I am not sure as to how the cufftxt implements multi GPU FFT though)The first two tasks are quite trivial so I think the bulk of the work will be on implementing the pressure solver