You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a PMIx multi-node job on a Slurm cluster is run, mpirun fails silently when prted is not in the search path or found otherwise.
This may not be a common scenario, but when running mpi releated software via the Nix package manager, such a scenario can easily be created. The problem was discovered and solved here: NixOS/nixpkgs#377616.
To get an error message that reveals the problem (i.e. prted not found), requires to set in increased verbosity level for the plm modules via --prtemca plm_base_verbose 10 .
Expected behavior
An error such as prted: No such file or directory should be displayed at the default verbosity level.
Versions.
OS: NixOS-24.11
prrte-3.0.8
openmpi-5.0.6
The text was updated successfully, but these errors were encountered:
The error message you cite is coming from Slurm, not PRRTE. We tie off stdout and stderr for the srun used to start the remote daemons for scaling purposes, so we aren't going to change that setting. The suggested steps for debugging launch are in the documentation and start with enabling stderr/stdout using --debug-daemons, or --debug, or setting the plm verbosity to anything above zero.
We should not, however, fail silently - supposed to output a "daemons failed to launch" message - so I'll take a peek at that one.
Problem
When running a PMIx multi-node job on a Slurm cluster is run, mpirun fails silently when prted is not in the search path or found otherwise.
This may not be a common scenario, but when running mpi releated software via the Nix package manager, such a scenario can easily be created. The problem was discovered and solved here: NixOS/nixpkgs#377616.
To get an error message that reveals the problem (i.e. prted not found), requires to set in increased verbosity level for the plm modules via
--prtemca plm_base_verbose 10
.Expected behavior
An error such as
prted: No such file or directory
should be displayed at the default verbosity level.Versions.
The text was updated successfully, but these errors were encountered: