-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document HPC tight integration #3397
Comments
Testing
|
Terminology
SLURM support
DesignI envision something like: Running interactively.PyMAPDL is on the login node. There we start a python console and we write: from ansys.mapdl.core import launch_mapdl
slurm_options = {
"partition": "qsmall",
"nodes": 6,
"ntasks-per-node": 8
}
mapdl = launch_mapdl(slurm_options=slurm_options)
# if no `slurm_options` are used, we should specify something like:
# mapdl = launch_mapdl(ON_SLURM=True)
# So we do the switch. Because there is no way to know if this a normal machine or a login node of an HPC cluster (check for sbatch process??)
mapdl.prep7() ... which under the hood will use an # --partition=qsmall --job-name=sbatch_ansys --ntasks=16 --nodes=8 --ntasks-per-node=2 --export=all
sbatch %slurm_options% --wrap "/ansys_inc/v242/ansys/bin/ansys242 -grpc -dis -np 8 -b -i /home/ubuntu/input.inp" Proper handleing of options should be done, but each MAPDL instance will be an independent SLURM job, so they can utilize all the resources allocated for that job. One complication could be how to obtain the IP of the remote instance. We can use scontrol show job $JOBID | grep BatchHost | cut -d'=' -f2 We can use MapdlPool caseFor the MAPDL pool case, there should be no problem being the same as with from ansys.mapdl.core import MapdlPool
slurm_options = {
"partition": "qsmall",
"nodes": 6,
"ntasks-per-node": 8
}
pool = MapdlPool(8, slurm_options=slurm_options)
# Spawn 8 MAPDL instances with 48 cores each (6 nodes each one and 8 task per node) I think making the divisions by ourselves will just complicate stuff. Running batch modeOn the login node we will do something like: sbatch my_script.sh where # Define SLURM options here
source /path/to/my/venv/bin/activate
python my_python_script In this case, the under the hood we will a normal This case seems simple. The code is managed as non-hpc local, only the env vars need to be in place. MapdlPool caseFor this case, things get a bit trickier. We have to options:
The first case is the most simple to implement but I can see the performance is not as good as the second. I guess from the user perspective, we should keep things simple. I can imagine an option (bool) like from ansys.mapdl.core import MapdlPool
slurm_options = {
"partition": "qsmall",
"nodes": 16,
"ntasks-per-node": 1
}
# All MAPDL instances spawn across all nodes.
# using MAPDL -nproc 16
pool = MapdlPool(8, slurm_options=slurm_options, share_nodes=True)
# All MAPDL instances spawn across all nodes.
# using MAPDL -nproc 16 and:
# On first MAPDL instance: srun --nodes=node0-1
# On second MAPDL instance: srun --nodes=node2-3
# and so on...
pool = MapdlPool(8, slurm_options=slurm_options, share_nodes=False) But what should happen when the resources are not evenly splitted? I guess we should fallback to mode 1 ( Feedback is appreciated here too. Summary
|
I think this should be converted to a document section |
Comment moved to a new issue: #3467 |
Disclaimer: this is a very quick and dirty draft. It is not final info. Do not trust this information blindly
To run MAPDL across multiple nodes in an HPC cluster, there are two approaches.
In both cases, the env var
ANS_MULTIPLE_NODES
is needed. It does several things, for instance it detects the infinity band network interface.Default
You have specify the name of the compute nodes (machines) and the number of CPUs used (this can be obtained from the SLURM env vars).
Example script
Tight integration
You can rely on MPI tight integration with the scheduler by setting
HYDRA_BOOTSTRAP
env var equal to the scheduler.Notes
The text was updated successfully, but these errors were encountered: