-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI error - Communication failed between nodes #53
Comments
Hello @SurelyD, it seems that you have at least two overlapping issues in your case. There is something wrong with your MPI setup, but without additional details I wouldn't know about the specifics. A quick search on the internet shows that MPI errors similar to yours are frequent with some OpenMPI versions and Mellanox cards, and they can be solved by configuring MPI appropriately (see e.g. this issue). You may also want to test some other MPI implementation, such as MVAPICH. If possible, I would also recommend testing your MPI setup independently from GPUSPH (I believe the CUDA samples include some MPI examples these days). Aside from the MPI issue, you also seem to have one related specifically to your test case. This message:
indicates that you are running an extremely large simulation, with a domain size that spans over a billion cells (cells are used for particle sorting, fast neighbors search, domain splitting in multi-device and to preserve uniform accuracy throughout the domain). Since GPUSPH stores the cell index as a 32-bit unsigned integer where the two highest bits are reserved for multi-GPU/multi-node usage, there is a limit of 2^30 (around 10^12) cells in the domain. There are a few tricks that can be used to achieve this, depending on your use-case (e.g. rotate the domain and the gravity if you have a long slope). If you can share more details we can look for a solution. |
Please also note that, although GPUSPH supports it, until some time ago multi-node multi-GPU was not supported by multiple MPI libraries (at least MVAPICH), which were unable to handle multiple device contexts. Things might have changed but it might be worth trying 2 processes on each node (4 GPUSPH processes on 2 nodes, each using a different device) to rule out one of the issues. EDIT: now that I see better the command you used, I'm afraid you are running 2 processes on each node, each process attempting to use both GPUs. You should pass a single value to |
Hi,
I tried to run GPUSPH version 5 on the cluster using two nodes, two GPUs at each node, using the following line:
mpirun -np 4 -npernode 2 ./GPUSPH --device 0,1 # each device of Mahuika has two GPUs
However, the simulation was made only on one node and I got the following error messages:
[vgpuwbg001][[30539,1],0][connect/btl_openib_connect_udcm.c:1236:udcm_rc_qp_to_rtr] [vgpuwbg001][[30539,1],1][connect/btl_openib_connect_udcm.c:1236$ error modifing QP to RTR errno says Invalid argument [vgpuwbg002][[30539,1],3][connect/btl_openib_connect_udcm.c:1236:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument [vgpuwbg002][[30539,1],2][connect/btl_openib_connect_udcm.c:1236:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument
FATAL: cannot handle 1436584140 > 1073741823 cells
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.
Does anyone know how to fix it?
Thanks.
The text was updated successfully, but these errors were encountered: