Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi/slurm multi-node jobs fail silently when prted is not in path #2129

Open
markuskowa opened this issue Feb 2, 2025 · 1 comment
Open

Comments

@markuskowa
Copy link

Problem

When running a PMIx multi-node job on a Slurm cluster is run, mpirun fails silently when prted is not in the search path or found otherwise.
This may not be a common scenario, but when running mpi releated software via the Nix package manager, such a scenario can easily be created. The problem was discovered and solved here: NixOS/nixpkgs#377616.

To get an error message that reveals the problem (i.e. prted not found), requires to set in increased verbosity level for the plm modules via --prtemca plm_base_verbose 10 .

Expected behavior

An error such as prted: No such file or directory should be displayed at the default verbosity level.

Versions.

  • OS: NixOS-24.11
  • prrte-3.0.8
  • openmpi-5.0.6
@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2025

The error message you cite is coming from Slurm, not PRRTE. We tie off stdout and stderr for the srun used to start the remote daemons for scaling purposes, so we aren't going to change that setting. The suggested steps for debugging launch are in the documentation and start with enabling stderr/stdout using --debug-daemons, or --debug, or setting the plm verbosity to anything above zero.

We should not, however, fail silently - supposed to output a "daemons failed to launch" message - so I'll take a peek at that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants