You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When attempting to run a job on any p4d compute node, the job becomes stuck in the Slurm status, remaining in a pending state until it times out and retries. This issue occurs even when no special boot scripts are configured. I also did not see anything special in the CloudWatch dashboard logs or on the machine itself.
Observation: The compute node launches successfully, with no immediate errors or unusual logs observed during startup.
Workaround: Replacing the p4d.24xlarge instance type with g5.24xlarge resolves the issue, indicating the problem may be specific to p4d instances.
Potential Cause
This issue may be connected to the increased configuration times introduced in version 3.11.1, as reported in GitHub issue #6479. The longer setup duration might be impacting the readiness or responsiveness of p4d compute nodes in Slurm, causing jobs to remain in a CF state.
Temporary Fix
Downgrading to AWS ParallelCluster version 3.10.1 resolves the issue, allowing jobs to run on p4d instances without getting stuck. This suggests that the issue may be related to changes introduced in version 3.11.1.
Steps to Reproduce
Configure a cluster with a p4d compute node, as shown in the provided YAML configuration.
Attempt to submit a job to the p4d node.
Observe that the job remains in the Slurm queue, ultimately timing out and retrying without successfully executing.
Expected Behavior
The job should execute on the p4d compute node without getting stuck in CF status.
Request
Are known issues with p4d instances on AWS ParallelCluster version 3.11.1 or specific configurations required to support job execution on p4d nodes?
The text was updated successfully, but these errors were encountered:
QuintenSchrevens
changed the title
p4d instance not able to spin-up with pcluster 3.11.1
p4d instance not able to run job with pcluster 3.11.1
Nov 6, 2024
We are taking a look at it and will post an update here soon.
We observed that downgrading the NVIDIa drivers to version 535.183.01 solves the problem.
Would this be a viable solution for you?
Issue: Job Stuck on p4d Compute Node
Required Information
test-cluster
eu-west-1
Cluster Configuration (Sensitive information omitted)
Bug Description
When attempting to run a job on any p4d compute node, the job becomes stuck in the Slurm status, remaining in a pending state until it times out and retries. This issue occurs even when no special boot scripts are configured. I also did not see anything special in the CloudWatch dashboard logs or on the machine itself.
p4d.24xlarge
instance type withg5.24xlarge
resolves the issue, indicating the problem may be specific top4d
instances.Potential Cause
This issue may be connected to the increased configuration times introduced in version 3.11.1, as reported in GitHub issue #6479. The longer setup duration might be impacting the readiness or responsiveness of p4d compute nodes in Slurm, causing jobs to remain in a CF state.
Temporary Fix
Downgrading to AWS ParallelCluster version 3.10.1 resolves the issue, allowing jobs to run on p4d instances without getting stuck. This suggests that the issue may be related to changes introduced in version 3.11.1.
Steps to Reproduce
Expected Behavior
The job should execute on the p4d compute node without getting stuck in CF status.
Request
Are known issues with p4d instances on AWS ParallelCluster version 3.11.1 or specific configurations required to support job execution on p4d nodes?
The text was updated successfully, but these errors were encountered: