p4d instance not able to run job with pcluster 3.11.1 #6549

QuintenSchrevens · 2024-11-06T23:09:23Z

Issue: Job Stuck on p4d Compute Node

Required Information

AWS ParallelCluster Version: 3.11.1
Cluster Name: test-cluster
Region: eu-west-1

Cluster Configuration (Sensitive information omitted)

HeadNode:
  InstanceType: c5.large
  Networking:
    SubnetId: subnet-xxxxxxxxxx
    AdditionalSecurityGroups:
      - sg-xxxxxxxxxxxxxx
  LocalStorage:
    RootVolume:
      VolumeType: gp3
      Size: 200
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - Policy: arn:aws:iam::aws:policy/AmazonS3FullAccess
      - Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
      - Policy: arn:aws:iam::xxxxxxxxxxxxxxxxxxx

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    MungeKeySecretArn: xxxxxxxxxxxxx
  SlurmQueues:
    - Name: a100
      ComputeResources:
        - Name: p4d
          Instances:
            - InstanceType: p4d.24xlarge
          MinCount: 0
          MaxCount: 5
      Networking:
        SubnetIds:
          - subnet-xxxxxxxxxxxx
        AdditionalSecurityGroups:
          - sg-xxxxxxxxxxxxx
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
            Size: 200
      Iam:
        AdditionalIamPolicies:
          - Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess

Image:
  Os: alinux2023
  CustomAmi: ami-xxxxxxxxxxxxx

Bug Description

When attempting to run a job on any p4d compute node, the job becomes stuck in the Slurm status, remaining in a pending state until it times out and retries. This issue occurs even when no special boot scripts are configured. I also did not see anything special in the CloudWatch dashboard logs or on the machine itself.

Observation: The compute node launches successfully, with no immediate errors or unusual logs observed during startup.
Workaround: Replacing the p4d.24xlarge instance type with g5.24xlarge resolves the issue, indicating the problem may be specific to p4d instances.

Potential Cause
This issue may be connected to the increased configuration times introduced in version 3.11.1, as reported in GitHub issue #6479. The longer setup duration might be impacting the readiness or responsiveness of p4d compute nodes in Slurm, causing jobs to remain in a CF state.

Temporary Fix

Downgrading to AWS ParallelCluster version 3.10.1 resolves the issue, allowing jobs to run on p4d instances without getting stuck. This suggests that the issue may be related to changes introduced in version 3.11.1.

Steps to Reproduce

Configure a cluster with a p4d compute node, as shown in the provided YAML configuration.
Attempt to submit a job to the p4d node.
Observe that the job remains in the Slurm queue, ultimately timing out and retrying without successfully executing.

Expected Behavior

The job should execute on the p4d compute node without getting stuck in CF status.

Request

Are known issues with p4d instances on AWS ParallelCluster version 3.11.1 or specific configurations required to support job execution on p4d nodes?

The text was updated successfully, but these errors were encountered:

gmarciani · 2024-11-15T10:04:08Z

Hi @QuintenSchrevens,
thank you for reporting this problem.

We are taking a look at it and will post an update here soon.
We observed that downgrading the NVIDIa drivers to version 535.183.01 solves the problem.
Would this be a viable solution for you?

Thank you.

gmarciani · 2024-11-25T11:23:28Z

Issue and mitigation: #6571

QuintenSchrevens added the 3.x label Nov 6, 2024

QuintenSchrevens changed the title ~~p4d instance not able to spin-up with pcluster 3.11.1~~ p4d instance not able to run job with pcluster 3.11.1 Nov 6, 2024

gmarciani added known issue pending release labels Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p4d instance not able to run job with pcluster 3.11.1 #6549

p4d instance not able to run job with pcluster 3.11.1 #6549

QuintenSchrevens commented Nov 6, 2024 •

edited

Loading

gmarciani commented Nov 15, 2024

gmarciani commented Nov 25, 2024

p4d instance not able to run job with pcluster 3.11.1 #6549

p4d instance not able to run job with pcluster 3.11.1 #6549

Comments

QuintenSchrevens commented Nov 6, 2024 • edited Loading

Required Information

Cluster Configuration (Sensitive information omitted)

Bug Description

Temporary Fix

Steps to Reproduce

Expected Behavior

Request

gmarciani commented Nov 15, 2024

gmarciani commented Nov 25, 2024

QuintenSchrevens commented Nov 6, 2024 •

edited

Loading