STRUMPACK with > 1 GPU per node #108

sebastiangrimberg · 2023-09-28T18:15:22Z

I'm having some issues running STRUMPACK with more than a single GPU per node. By comparison, SuperLU_DIST is fine. This is with CUDA. In particular, if I run with 2 MPI processes (not CUDA aware), where each process is assigned in my own code to devices 0 and 1, I get:

CUDA assertion failed: invalid resource handle /build/extern/STRUMPACK/src/dense/CUDAWrapper.cu 112

Is there anything special we need to do building STRUMPACK + MPI with CUDA support?

The text was updated successfully, but these errors were encountered:

pghysels · 2023-09-28T18:24:01Z

You should run 1 MPI process per GPU (sounds like that is what you are trying to do).
On systems like Perlmutter or Frontier etc, you can use the job scheduler to make sure that each MPI process only sees a single GPU device.

But if an MPI process sees multiple GPU devices, we do in STRUMPACK:
cudaSetDevice(rank % devs);
where rank is the MPI rank and devs is the number of GPUs, from cudaGetDeviceCount.

What exactly do you mean with:
where each process is assigned in my own code to devices 0 and 1
?

sebastiangrimberg · 2023-09-28T18:32:20Z

What exactly do you mean with: where each process is assigned in my own code to devices 0 and 1 ?

I mean I am calling cudaSetDevice(device_id); in my application code, before (and after) calling STRUMPACK since I'm doing some matrix assembly on the GPU.

I'm not sure why this isn't working though. It seems like the strategy from inside of STRUMPACK to call cudaSetDevice is probably setting the same exact GPU device as I have set in my code beforehand, even if it shouldn't make a difference in this circumstance.

pghysels · 2023-09-28T19:06:38Z

Could it be due to MAGMA? Can you try without MAGMA?

I think the code at this line is not executed when running without MAGMA:
CUDA assertion failed: invalid resource handle /build/extern/STRUMPACK/src/dense/CUDAWrapper.cu 112

sebastiangrimberg · 2023-09-28T19:22:39Z

Hm, odd. Building without MAGMA the error message I get is:

CUDA error: (cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToHost)) failed with error:
 --> an illegal memory access was encountered

which occurs outside of the STRUMPACK solve. Again, everything is totally fine with using SuperLU_DIST (which is using GPU) or a CPU-based direct solve. It seems like maybe STRUMPACK is corrupting memory somewhere?

pghysels · 2023-09-28T21:25:19Z

I'll see if I can find a machine with multiple GPUs per node.
Maybe I can reproduce on Permutter.

sebastiangrimberg · 2023-09-28T21:27:24Z

Awesome, thanks! For reference, I'm running on AWS, on a single p3.8xlarge EC2 instance (4 x V100 GPU).

pghysels · 2023-09-30T18:38:10Z

I can reproduce it on Perlmutter, using setup (1): https://docs.nersc.gov/systems/perlmutter/running-jobs/#1-node-4-tasks-4-gpus-all-gpus-visible-to-all-tasks
With setup (2) (https://docs.nersc.gov/systems/perlmutter/running-jobs/#1-node-4-tasks-4-gpus-1-gpu-visible-to-each-task) it works fine.

The only difference between setups (1) and (2) is that for (1) STRUMPACK calls cudaSetDevice.

I did also notice that it runs fine with setup (1) when I add export OMP_NUM_THREADS=1.

I will investigate further.

You say it works with SuperLU. Did you set SUPERLU_BIND_MPI_GPU?

sebastiangrimberg · 2023-10-02T15:32:52Z

Awesome, thank you for your help with this and great that you can reproduce. No I did not set SUPERLU_BIND_MPI_GPU when testing with SuperLU_DIST. That looks like it controls whether or not SuperLU will call cudaSetDevice during setup, so this is consistent with your findings that STRUMPACK works OK when not calling cudaSetDevice. It's interesting that OMP_NUM_THREADS also affects the result, I was also running with OMP_NUM_THREADS=1 for my tests. I'll spend some time looking into this as well so let me know if I can be of help in any way.

pghysels · 2023-10-02T18:52:01Z

It works correctly with SuperLU with SUPERLU_BIND_MPI_GPU.

I can't figure out what is wrong. I know that calling cudaSetDevice will reset the device, and then all streams etc will be invalid. But cudaSetDevice is the first CUDA call in STRUMPACK.
I thought it might be due to CUDA aware MPI, so I tried to disable that, but that doesn't make a difference.

Perhaps you can set CUDA_VISIBLE_DEVICES, but you need to find a way to set that to a different value for different MPI ranks.

sebastiangrimberg · 2023-10-02T18:57:42Z

I wonder if the issue is somehow an interplay between STRUMPACK and SLATE (I noticed SLATE also has calls to cudaSetDevice internally). I'm not super familiar with how STRUMPACK is using SLATE but this is definitely a differentiator vs. SuperLU.

pghysels · 2023-10-02T18:59:12Z

I also see the issue without linking with SLATE (or MAGMA).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STRUMPACK with > 1 GPU per node #108

STRUMPACK with > 1 GPU per node #108

sebastiangrimberg commented Sep 28, 2023

pghysels commented Sep 28, 2023

sebastiangrimberg commented Sep 28, 2023

pghysels commented Sep 28, 2023

sebastiangrimberg commented Sep 28, 2023

pghysels commented Sep 28, 2023

sebastiangrimberg commented Sep 28, 2023

pghysels commented Sep 30, 2023

sebastiangrimberg commented Oct 2, 2023

pghysels commented Oct 2, 2023

sebastiangrimberg commented Oct 2, 2023 •

edited

Loading

pghysels commented Oct 2, 2023

STRUMPACK with > 1 GPU per node #108

STRUMPACK with > 1 GPU per node #108

Comments

sebastiangrimberg commented Sep 28, 2023

pghysels commented Sep 28, 2023

sebastiangrimberg commented Sep 28, 2023

pghysels commented Sep 28, 2023

sebastiangrimberg commented Sep 28, 2023

pghysels commented Sep 28, 2023

sebastiangrimberg commented Sep 28, 2023

pghysels commented Sep 30, 2023

sebastiangrimberg commented Oct 2, 2023

pghysels commented Oct 2, 2023

sebastiangrimberg commented Oct 2, 2023 • edited Loading

pghysels commented Oct 2, 2023

sebastiangrimberg commented Oct 2, 2023 •

edited

Loading