Skip to content

Add NVSwitch device ID for p6 instance type #2987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

himani2411
Copy link
Contributor

@himani2411 himani2411 commented Jul 3, 2025

Description of changes

  • Add NVSwitch device ID for p6 instance type as NVIDIA Fabric manager needs to be enabled for GPU Health Checks to be invoked.

Steps for Device ID: https://nvidia.custhelp.com/app/answers/detail/a_id/2040/~/identifying-the-graphics-card-model-and-device-id-in-a-pc

Tests

  • Cluster launch with p4d instance type and Log line (which is now removed)
[2025-07-03T18:27:07+00:00] INFO: NVSwitch works 6

    * service[nvidia-fabricmanager] action start[2025-07-03T18:27:07+00:00] INFO: Processing service[nvidia-fabricmanager] action start (aws-parallelcluster-platform::test line 36)
 (up to date)

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

nvswitch_check_p4 = shell_out("lspci -d 10de:1af1 | wc -l")
nvswitch_check_p5 = shell_out("lspci -d 10de:22a3 | wc -l")
nvswitch_check_p4.stdout.strip.to_i + nvswitch_check_p5.stdout.strip.to_i
# NVSwitch device id is 10de:2901 for P6 instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we take the device id 10de:2901 from?
Is there some public reference that we can link to the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvswitch_check_p4.stdout.strip.to_i + nvswitch_check_p5.stdout.strip.to_i
# NVSwitch device id is 10de:2901 for P6 instance
nvswitch_device_ids = ['10de:1af1', '10de:22a3', '10de:2901']
nvswitch_device_ids.sum { |id| shell_out("lspci -d #{id} | wc -l").stdout.strip.to_i }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we summing up all the number of switches rather than returning the specific number for the specific instance type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These device Id's are based on the GPU being used, and the solution is irrespective of the instance type as we use device ID of GPU's for which we know have NVswitches

@@ -54,10 +54,10 @@ def _nvidia_driver_version

# Get number of nv switches
def get_nvswitches
Copy link
Contributor

@gmarciani gmarciani Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cover this change within the fabric manager spec test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will try to see if I can for this function as a Unit test

@himani2411 himani2411 force-pushed the nvdia-fabric-manager branch from 5a4a61b to 07dfcb0 Compare July 3, 2025 20:24
@himani2411 himani2411 force-pushed the nvdia-fabric-manager branch from 07dfcb0 to 1e9a845 Compare July 3, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants