Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p4d instance not able to run job with pcluster 3.11.1 #6549

Open
QuintenSchrevens opened this issue Nov 6, 2024 · 2 comments
Open

p4d instance not able to run job with pcluster 3.11.1 #6549

QuintenSchrevens opened this issue Nov 6, 2024 · 2 comments

Comments

@QuintenSchrevens
Copy link

QuintenSchrevens commented Nov 6, 2024

Issue: Job Stuck on p4d Compute Node

Required Information

  • AWS ParallelCluster Version: 3.11.1
  • Cluster Name: test-cluster
  • Region: eu-west-1

Cluster Configuration (Sensitive information omitted)

HeadNode:
  InstanceType: c5.large
  Networking:
    SubnetId: subnet-xxxxxxxxxx
    AdditionalSecurityGroups:
      - sg-xxxxxxxxxxxxxx
  LocalStorage:
    RootVolume:
      VolumeType: gp3
      Size: 200
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - Policy: arn:aws:iam::aws:policy/AmazonS3FullAccess
      - Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
      - Policy: arn:aws:iam::xxxxxxxxxxxxxxxxxxx

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    MungeKeySecretArn: xxxxxxxxxxxxx
  SlurmQueues:
    - Name: a100
      ComputeResources:
        - Name: p4d
          Instances:
            - InstanceType: p4d.24xlarge
          MinCount: 0
          MaxCount: 5
      Networking:
        SubnetIds:
          - subnet-xxxxxxxxxxxx
        AdditionalSecurityGroups:
          - sg-xxxxxxxxxxxxx
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
            Size: 200
      Iam:
        AdditionalIamPolicies:
          - Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess

Image:
  Os: alinux2023
  CustomAmi: ami-xxxxxxxxxxxxx

Bug Description

When attempting to run a job on any p4d compute node, the job becomes stuck in the Slurm status, remaining in a pending state until it times out and retries. This issue occurs even when no special boot scripts are configured. I also did not see anything special in the CloudWatch dashboard logs or on the machine itself.

  • Observation: The compute node launches successfully, with no immediate errors or unusual logs observed during startup.
  • Workaround: Replacing the p4d.24xlarge instance type with g5.24xlarge resolves the issue, indicating the problem may be specific to p4d instances.

Potential Cause
This issue may be connected to the increased configuration times introduced in version 3.11.1, as reported in GitHub issue #6479. The longer setup duration might be impacting the readiness or responsiveness of p4d compute nodes in Slurm, causing jobs to remain in a CF state.

Temporary Fix

Downgrading to AWS ParallelCluster version 3.10.1 resolves the issue, allowing jobs to run on p4d instances without getting stuck. This suggests that the issue may be related to changes introduced in version 3.11.1.

Steps to Reproduce

  1. Configure a cluster with a p4d compute node, as shown in the provided YAML configuration.
  2. Attempt to submit a job to the p4d node.
  3. Observe that the job remains in the Slurm queue, ultimately timing out and retrying without successfully executing.

Expected Behavior

The job should execute on the p4d compute node without getting stuck in CF status.

Request

Are known issues with p4d instances on AWS ParallelCluster version 3.11.1 or specific configurations required to support job execution on p4d nodes?

@QuintenSchrevens QuintenSchrevens changed the title p4d instance not able to spin-up with pcluster 3.11.1 p4d instance not able to run job with pcluster 3.11.1 Nov 6, 2024
@gmarciani
Copy link
Contributor

Hi @QuintenSchrevens,
thank you for reporting this problem.

We are taking a look at it and will post an update here soon.
We observed that downgrading the NVIDIa drivers to version 535.183.01 solves the problem.
Would this be a viable solution for you?

Thank you.

@gmarciani
Copy link
Contributor

Issue and mitigation: #6571

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants