Skip to content

Commit

Permalink
feat: support for launching an MAPDL instance in an SLURM HPC cluster (
Browse files Browse the repository at this point in the history
…#3497)

* feat: adding env vars needed for multinode

* feat: adding env vars needed for multinode

* feat: renaming hpc detection argument

* docs: adding documentation

* chore: adding changelog file 3466.documentation.md

* feat: adding env vars needed for multinode

* feat: renaming hpc detection argument

* docs: adding documentation

* chore: adding changelog file 3466.documentation.md

* fix: vale issues

* chore: To fix sphinx build

Squashed commit of the following:

commit c1d1a3e
Author: German <[email protected]>
Date:   Mon Oct 7 15:33:19 2024 +0200

    ci: retrigger CICD

commit b7b5c30
Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Date:   Mon Oct 7 13:31:55 2024 +0000

    ci: auto fixes from pre-commit.com hooks.

    for more information, see https://pre-commit.ci

commit 32a1c02
Author: Revathy Venugopal <[email protected]>
Date:   Mon Oct 7 15:31:24 2024 +0200

    fix: add suggestions

    Co-authored-by: German <[email protected]>

commit 575a219
Merge: f2afe13 be1be2e
Author: Revathyvenugopal162 <[email protected]>
Date:   Mon Oct 7 15:09:01 2024 +0200

    Merge branch 'fix/add-build-cheatsheet-as-env-varaible' of https://github.com/ansys/pymapdl into fix/add-build-cheatsheet-as-env-varaible

commit f2afe13
Author: Revathyvenugopal162 <[email protected]>
Date:   Mon Oct 7 15:08:58 2024 +0200

    fix: precommit

commit be1be2e
Author: pyansys-ci-bot <[email protected]>
Date:   Mon Oct 7 13:07:35 2024 +0000

    chore: adding changelog file 3468.fixed.md

commit f052a4d
Author: Revathyvenugopal162 <[email protected]>
Date:   Mon Oct 7 15:05:56 2024 +0200

    fix: add build cheatsheet as env variable within doc-build

* docs: expanding a bit troubleshooting advices and small format fix

* docs: fix vale

* fix: nproc tests

* feat: adding env vars needed for multinode

* feat: renaming hpc detection argument

* docs: adding documentation

* chore: adding changelog file 3466.documentation.md

* fix: vale issues

* docs: fix vale

* docs: expanding a bit troubleshooting advices and small format fix

* fix: nproc tests

* revert: "chore: To fix sphinx build"

This reverts commit e45d2e5.

* docs: clarifying where everything is running.

* docs: expanding bash example

* tests: fix

* docs: adding `PYMAPDL_NPROC` to env var section

* feat: adding 'pymapdl_proc' to non-slurm run. Adding tests too.

* docs: fix vale issue

* docs: fix vale issue

* fix: replacing env var name

* feat: first 'launch_mapdl_on_cluster` draft

* feat: added arguments to 'launch_mapdl_on_cluster'.
Added also properties `hostname`, `jobid` and `_mapdl_on_slurm`.

* feat: better error messages. Created 'generate_sbatch_command'.

* refactor: rename 'detect_HPC' to 'detect_hpc'. Introducing 'launch_on_hpc'.

* refactor: move all the functionality to launch_mapdl

* feat: launched is fixed now in 'launcher' silently.

* refactor: using `PYMAPDL_RUNNING_ON_HPC` as env var.
Fixing bugs and tests

* chore: adding changelog file 3497.documentation.md [dependabot-skip]

* refactor: rename to `scheduler_args`

* fix: launching issues

* fix: tests

* docs: formatting changes.

* docs: more cosmetic changes.

* tests: adding 'launch_grpc' testing.

* tests: adding some unit tests

* fix: unit tests

* chore: adding changelog file 3466.documentation.md [dependabot-skip]

* fix: adding missing import

* refactoring: `check_mapdl_launch_on_hpc` and addressing codacity issues

* fix: test

* refactor: exit method. Externalising to _exit_mapdl function.

* fix: not running all tests.

* tests: adding test to __del__.

* refactor: patching exit to avoid raising exception. I need to fix this later better.

* refactor: not asking for version or checking exec_file path if 'launch_on_hpc' is true.

* tests: increasing coverage

* test: adding stack for patching MAPDL launching.

* refactor: to allow more coverage

* feat: avoid checking the underlying processes when running on HPC

* tests: increasing coverage

* chore: adding coverage to default pytesting. Adding _commands for checking coverage.

* fix: remote launcher

* fix: raising exceptions in __del__ method

* fix: weird missing reference (import) when exiting

* chore/making sure we regress to the right state after the tests

* test: fix test

* fix: not checking the mode

* refactor: reorg ip section on init. Adding better str representation to MapdlGrpc

* feat: avoid killing MAPDL if not `finish_job_on_exit`. Adding also a property for `finish_job_on_exit`.

* feat: raising error if specifying IP when `launch_on_hpc`.

* feat: increasing grpc error handling options to 3s or 5 attempts.

* feat: renaming to scheduler_options. Using variable default start_timeout. Raise an exception if scheduler options are given, but not nproc. Fix scontrol call.

* refactor: added types

* refactor: launcher args order

* refactor: tests

* fix: reusing connection attr.

* fix: pass start_timeout to `get_job_info`.

* fix: test

* fix: test

* tests: not requiring warning if on minimal since ATP is not present.

* feat: simplifying directory property

* feat: using cached version of directory.

* feat: simplifying directory property

* chore: adding changelog file 3517.miscellaneous.md [dependabot-skip]

* test: adding test

* feat: caching directory in cwd

* refactor: mapdl patcher

* feat: caching directory in cwd

* feat: caching directory for sure.

* feat: caching dir at the cwd level.

* feat: retry mechanism inside /INQUIRE

* feat: changing exception message

* feat: adding tests

* feat: caching directory

* chore: adding changelog file 3517.added.md [dependabot-skip]

* refactor: avoid else in while.

* refactor: using a temporary variable to avoid overwrite self._path
Raise error if empty response only  if non_interactive mode.

* fix: not keeping state between tests

* fix: making sure the state is reset between tests

* fix: warning when exiting.

* fix: test

* feat: using a trimmed version for delete.

* refactor: test to pass

* refactor: removing all cleaning from __del__ except ending HPC job.

* refactor: changing `detect_hpc` with `running_on_hpc`.
Simplifying `launch_mapdl_on_cluster`.

* docs: adding-sbatch-support (#3513)

* docs: expanding a bit the `PyMAPDL on HPC clusters` section

* docs: adding info about launching MAPDL in HPC.

* chore: adding changelog file 3513.documentation.md [dependabot-skip]

* fix: vale issues

* docs: changing the name to `scheduler_options`.
Add warning about adding nproc.

* fix: vale issues

* docs: apply suggestions from Kathy code review

Co-authored-by: Kathy Pippert <[email protected]>

* docs: adding CPUs.

---------

Co-authored-by: pyansys-ci-bot <[email protected]>
Co-authored-by: Kathy Pippert <[email protected]>

* feat: avoid exceptions on `__del__`

* tests: adding tests for get_port and get_ip

* feat: using a submitter function for grouping.

* tests: attempting clean exit

* feat: externalising to function getting the batchhost

* tests: increasing coverage

* tests: fix

* fix: doc builds

* tests: increasing coverage

* fix: not passing args

* tests: increase coverage

* fix: tests

* fix: fixture

* ci: uploading bandit reports as artifact.

* docs: adding descriptor to phrase

---------

Co-authored-by: pyansys-ci-bot <[email protected]>
Co-authored-by: Kathy Pippert <[email protected]>
  • Loading branch information
3 people authored Oct 29, 2024
1 parent e970359 commit 0640963
Show file tree
Hide file tree
Showing 18 changed files with 1,987 additions and 377 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ jobs:
token: ${{ secrets.PYANSYS_CI_BOT_TOKEN }}
python-package-name: ${{ env.PACKAGE_NAME }}
dev-mode: ${{ github.ref != 'refs/heads/main' }}
upload-reports: True

docs-build:
name: "Build documentation"
Expand Down Expand Up @@ -774,6 +775,7 @@ jobs:
env:
ON_LOCAL: true
ON_UBUNTU: true
TESTING_MINIMAL: true

steps:
- name: "Install Git and checkout project"
Expand Down
1 change: 1 addition & 0 deletions doc/changelog.d/3497.documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
feat: support for launching an MAPDL instance in an SLURM HPC cluster
1 change: 1 addition & 0 deletions doc/changelog.d/3513.documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docs: adding-sbatch-support
229 changes: 229 additions & 0 deletions doc/source/user_guide/hpc/launch_mapdl_entrypoint.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@

Interactive MAPDL instance launched from the login node
=======================================================

Starting the instance
---------------------

If you are already logged in a login node, you can launch an MAPDL instance as a SLURM job and
connect to it.
To accomplish this, run these commands in your login node.

.. code:: pycon
>>> from ansys.mapdl.core import launch_mapdl
>>> mapdl = launch_mapdl(launch_on_hpc=True)
PyMAPDL submits a job to the scheduler using the appropriate commands.
In case of SLURM, it uses the ``sbatch`` command with the ``--wrap`` argument
to pass the MAPDL command line to start.
Other scheduler arguments can be specified using the ``scheduler_options``
argument as a Python :class:`dict`:

.. code:: pycon
>>> from ansys.mapdl.core import launch_mapdl
>>> scheduler_options = {"nodes": 10, "ntasks-per-node": 2}
>>> mapdl = launch_mapdl(launch_on_hpc=True, nproc=20, scheduler_options=scheduler_options)
.. note::
PyMAPDL cannot infer the number of CPUs that you are requesting from the scheduler.
Hence, you must specify this value using the ``nproc`` argument.

The double minus (``--``) common in the long version of some scheduler commands
are added automatically if PyMAPDL detects it is missing and the specified
command is long more than 1 character in length).
For instance, the ``ntasks-per-node`` argument is submitted as ``--ntasks-per-node``.

Or, a single Python string (:class:`str`) is submitted:

.. code:: pycon
>>> from ansys.mapdl.core import launch_mapdl
>>> scheduler_options = "-N 10"
>>> mapdl = launch_mapdl(launch_on_hpc=True, scheduler_options=scheduler_options)
.. warning::
Because PyMAPDL is already using the ``--wrap`` argument, this argument
cannot be used again.

The values of each scheduler argument are wrapped in single quotes (`'`).
This might cause parsing issues that can cause the job to fail after successful
submission.

PyMAPDL passes all the environment variables of the
user to the new job and to the MAPDL instance.
This is usually convenient because many environmental variables are
needed to run the job or MAPDL command.
For instance, the license server is normally stored in the :envvar:`ANSYSLMD_LICENSE_FILE` environment variable.
If you prefer not to pass these environment variables to the job, use the SLURM argument
``--export`` to specify the desired environment variables.
For more information, see `SLURM documentation <slurm_docs_>`_.


Working with the instance
-------------------------

Once the :class:`Mapdl <ansys.mapdl.core.mapdl.MapdlBase>` object has been created,
it does not differ from a normal :class:`Mapdl <ansys.mapdl.core.mapdl.MapdlBase>`
instance.
You can retrieve the IP of the MAPDL instance as well as its hostname:

.. code:: pycon
>>> mapdl.ip
'123.45.67.89'
>>> mapdl.hostname
'node0'
You can also retrieve the SLURM job ID:

.. code:: pycon
>>> mapdl.jobid
10001
If you want to check whether the instance has been launched using a scheduler,
you can use the :attr:`mapdl_on_hpc <ansys.mapdl.core.mapdl_grpc.MapdlGrpc.mapdl_on_hpc>`
attribute:

.. code:: pycon
>>> mapdl.mapdl_on_hpc
True
Sharing files
^^^^^^^^^^^^^

Most of the HPC clusters share the login node filesystem with the compute nodes,
which means that you do not need to do extra work to upload or download files to the MAPDL
instance. You only need to copy them to the location where MAPDL is running.
You can obtain this location with the
:attr:`directory <ansys.mapdl.core.mapdl_grpc.MapdlGrpc.directory>` attribute.

If no location is specified in the :func:`launch_mapdl() <ansys.mapdl.core.launcher.launch_mapdl>`
function, then a temporal location is selected.
It is a good idea to set the ``run_location`` argument to a directory that is accessible
from all the compute nodes.
Normally anything under ``/home/user`` is available to all compute nodes.
If you are unsure where you should launch MAPDL, contact your cluster administrator.

Additionally, you can use methods like the :meth:`upload <ansys.mapdl.core.mapdl_grpc.MapdlGrpc.upload>`
and :meth:`download <ansys.mapdl.core.mapdl_grpc.MapdlGrpc.download>` to
upload and download files to and from the MAPDL instance respectively.
You do not need ``ssh`` or another similar connection.
However, for large files, you might want to consider alternatives.


Exiting MAPDL
-------------

Exiting MAPDL, either intentionally or unintentionally, stops the job.
This behavior occurs because MAPDL is the main process at the job. Thus, when finished,
the scheduler considers the job done.

To exit MAPDL, you can use the :meth:`exit() <ansys.mapdl.core.Mapdl.exit>` method.
This method exits MAPDL and sends a signal to the scheduler to cancel the job.

.. code-block:: python
mapdl.exit()
When the Python process you are running PyMAPDL on finishes without errors, and you have not
issued the :meth:`exit() <ansys.mapdl.core.Mapdl.exit>` method, the garbage collector
kills the MAPDL instance and its job. This is intended to save resources.

If you prefer that the job is not killed, set the following attribute in the
:class:`Mapdl <ansys.mapdl.core.mapdl.MapdlBase>` class:

.. code-block:: python
mapdl.finish_job_on_exit = False
In this case, you should set a timeout in your job to avoid having the job
running longer than needed.


Handling crashes on an HPC
^^^^^^^^^^^^^^^^^^^^^^^^^^

If MAPDL crashes while running on an HPC, the job finishes right away.
In this case, MAPDL disconnects from MAPDL.
PyMAPDL retries to reconnect to the MAPDL instance up to 5 times, waiting
for up to 5 seconds.
If unsuccessful, you might get an error like this:

.. code-block:: text
MAPDL server connection terminated unexpectedly while running:
/INQUIRE,,DIRECTORY,,
called by:
_send_command
Suggestions:
MAPDL *might* have died because it executed a not-allowed command or ran out of memory.
Check the MAPDL command output for more details.
Open an issue on GitHub if you need assistance: https://github.com/ansys/pymapdl/issues
Error:
failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)
Full error:
<_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-10-24T08:25:04.054559811+00:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50052: Failed to connect to remote host: connect: Connection refused (111)"}"
>
The data of that job is available at :attr:`directory <ansys.mapdl.core.Mapdl.directory>`.
You should set the run location using the ``run_location`` argument.

While handling this exception, PyMAPDL also cancels the job to avoid resources leaking.
Therefore, the only option is to start a new instance by launching a new job using
the :func:`launch_mapdl() <ansys.mapdl.core.launcher.launch_mapdl>` function.

User case on a SLURM cluster
----------------------------

Assume a user wants to start a remote MAPDL instance in an HPC cluster
to interact with it.
The user would like to request 10 nodes, and 1 task per node (to avoid clashes
between MAPDL instances).
The user would like to also request 64 GB of RAM.
Because of administration logistics, the user must use the machines in
the ``supercluster01`` partition.
To make PyMAPDL launch an instance like that on SLURM, run the following code:

.. code-block:: python
from ansys.mapdl.core import launch_mapdl
from ansys.mapdl.core.examples import vmfiles
scheduler_options = {
"nodes": 10,
"ntasks-per-node": 1,
"partition": "supercluster01",
"memory": 64,
}
mapdl = launch_mapdl(launch_on_hpc=True, nproc=10, scheduler_options=scheduler_options)
num_cpu = mapdl.get_value("ACTIVE", 0, "NUMCPU") # It should be equal to 10
mapdl.clear() # Not strictly needed.
mapdl.prep7()
# Run an MAPDL script
mapdl.input(vmfiles["vm1"])
# Let's solve again to get the solve printout
mapdl.solution()
output = mapdl.solve()
print(output)
mapdl.exit() # Kill the MAPDL instance
PyMAPDL automatically sets MAPDL to read the job configuration (including machines,
number of CPUs, and memory), which allows MAPDL to use all the resources allocated
to that job.
62 changes: 47 additions & 15 deletions doc/source/user_guide/hpc/pymapdl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,35 +19,34 @@ on whether or not you run them both on the HPC compute nodes.
Additionally, you might be able interact with them (``interactive`` mode)
or not (``batch`` mode).

For information on supported configurations, see :ref:`ref_pymapdl_batch_in_cluster_hpc`.
PyMAPDL takes advantage of HPC clusters to launch MAPDL instances
with increased resources.
PyMAPDL automatically sets these MAPDL instances to read the
scheduler job configuration (which includes machines, number
of CPUs, and memory), which allows MAPDL to use all the resources
allocated to that job.
For more information, see :ref:`ref_tight_integration_hpc`.

The following configurations are supported:

Since v0.68.5, PyMAPDL can take advantage of the tight integration
between the scheduler and MAPDL to read the job configuration and
launch an MAPDL instance that can use all the resources allocated
to that job.
For instance, if a SLURM job has allocated 8 nodes with 4 cores each,
then PyMAPDL launches an MAPDL instance which uses 32 cores
spawning across those 8 nodes.
This behavior can turn off if passing the :envvar:`PYMAPDL_ON_SLURM`
environment variable or passing the ``detect_HPC=False`` argument
to the :func:`launch_mapdl() <ansys.mapdl.core.launcher.launch_mapdl>` function.
* :ref:`ref_pymapdl_batch_in_cluster_hpc`.
* :ref:`ref_pymapdl_interactive_in_cluster_hpc_from_login`


.. _ref_pymapdl_batch_in_cluster_hpc:

Submit a PyMAPDL batch job to the cluster from the entrypoint node
==================================================================
Batch job submission from the login node
========================================

Many HPC clusters allow their users to log into a machine using
``ssh``, ``vnc``, ``rdp``, or similar technologies and then submit a job
to the cluster from there.
This entrypoint machine, sometimes known as the *head node* or *entrypoint node*,
This login machine, sometimes known as the *head node* or *entrypoint node*,
might be a virtual machine (VDI/VM).

In such cases, once the Python virtual environment with PyMAPDL is already
set and is accessible to all the compute nodes, launching a
PyMAPDL job from the entrypoint node is very easy to do using the ``sbatch`` command.
PyMAPDL job from the login node is very easy to do using the ``sbatch`` command.
When the ``sbatch`` command is used, PyMAPDL runs and launches an MAPDL instance in
the compute nodes.
No changes are needed on a PyMAPDL script to run it on an SLURM cluster.
Expand Down Expand Up @@ -98,6 +97,8 @@ job by setting the :envvar:`PYMAPDL_NPROC` environment variable to the desired v
(venv) user@entrypoint-machine:~$ PYMAPDL_NPROC=4 sbatch main.py
For more applicable environment variables, see :ref:`ref_environment_variables`.

You can also add ``sbatch`` options to the command:

.. code-block:: console
Expand Down Expand Up @@ -181,3 +182,34 @@ This bash script performs tasks such as creating environment variables,
moving files to different directories, and printing to ensure your
configuration is correct.


.. _ref_pymapdl_interactive_in_cluster_hpc:


.. _ref_pymapdl_interactive_in_cluster_hpc_from_login:

.. include:: launch_mapdl_entrypoint.rst


.. _ref_tight_integration_hpc:

Tight integration between MAPDL and the HPC scheduler
=====================================================

Since v0.68.5, PyMAPDL can take advantage of the tight integration
between the scheduler and MAPDL to read the job configuration and
launch an MAPDL instance that can use all the resources allocated
to that job.
For instance, if a SLURM job has allocated 8 nodes with 4 cores each,
then PyMAPDL launches an MAPDL instance that uses 32 cores
spawning across those 8 nodes.

This behavior can turn off by passing the
:envvar:`PYMAPDL_RUNNING_ON_HPC` environment variable
with a ``'false'`` value or passing the ``detect_hpc=False`` argument
to the :func:`launch_mapdl() <ansys.mapdl.core.launcher.launch_mapdl>` function.

Alternatively, you can override these settings by either specifying
custom settings in the :func:`launch_mapdl() <ansys.mapdl.core.launcher.launch_mapdl>`
function's arguments or using specific environment variables.
For more information, see :ref:`ref_environment_variables`.
3 changes: 2 additions & 1 deletion doc/source/user_guide/mapdl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1092,6 +1092,7 @@ are unsupported.
| * ``LSWRITE`` | |:white_check_mark:| Available (Internally running in :attr:`Mapdl.non_interactive <ansys.mapdl.core.Mapdl.non_interactive>`) | |:white_check_mark:| Available | |:exclamation:| Only in :attr:`Mapdl.non_interactive <ansys.mapdl.core.Mapdl.non_interactive>` | |
+---------------+---------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+

.. _ref_environment_variables:

Environment variables
=====================
Expand Down Expand Up @@ -1189,7 +1190,7 @@ environment variable. The following table describes all arguments.
| | user@machine:~$ export PYMAPDL_MAPDL_VERSION=22.2 |
| | |
+---------------------------------------+----------------------------------------------------------------------------------+
| :envvar:`PYMAPDL_ON_SLURM` | With this environment variable set to ``FALSE``, you can avoid |
| :envvar:`PYMAPDL_RUNNING_ON_HPC` | With this environment variable set to ``FALSE``, you can avoid |
| | PyMAPDL from detecting that it is running on a SLURM HPC cluster. |
+---------------------------------------+----------------------------------------------------------------------------------+
| :envvar:`PYMAPDL_MAX_MESSAGE_LENGTH` | Maximum gRPC message length. If your |
Expand Down
1 change: 1 addition & 0 deletions doc/styles/config/vocabularies/ANSYS/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ CentOS7
Chao
ci
container_layout
CPUs
datas
delet
Dependabot
Expand Down
2 changes: 0 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,6 @@ src_paths = ["doc", "src", "tests"]
[tool.coverage.run]
source = ["ansys/pymapdl"]
omit = [
# omit commands
"ansys/mapdl/core/_commands/*",
# ignore legacy interfaces
"ansys/mapdl/core/mapdl_console.py",
"ansys/mapdl/core/jupyter.py",
Expand Down
6 changes: 3 additions & 3 deletions src/ansys/mapdl/core/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,9 +307,9 @@ def wrapper(*args, **kwargs):
old_handler = signal.signal(signal.SIGINT, handler)

# Capture gRPC exceptions
n_attempts = 3
initial_backoff = 0.05
multiplier_backoff = 3
n_attempts = 5
initial_backoff = 0.1
multiplier_backoff = 2

i_attemps = 0

Expand Down
Loading

0 comments on commit 0640963

Please sign in to comment.