Skip to content

Commit 328df5d

Browse files
himani2411Himani Deshpande
and
Himani Deshpande
authored
Add NVSwitch device ID for p5.48xlarge instance and test gpu_health_check for multi-gpu instances which require nvidia fabric manager to be enabled. (#2431)
Co-authored-by: Himani Deshpande <[email protected]>
1 parent 5891b3a commit 328df5d

File tree

3 files changed

+10
-4
lines changed

3 files changed

+10
-4
lines changed

cookbooks/aws-parallelcluster-platform/kitchen.platform-config.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,12 +78,14 @@ suites:
7878
- 'resource:package { "package_name": "dkms" }'
7979
- resource:build_tools
8080
- recipe:aws-parallelcluster-platform::nvidia_install
81+
# - resource:fabric_manager:configure # Needed for Multi-gpu instance like p5.48xlarge
8182
resource: gdrcopy:configure
8283
cluster:
8384
nvidia:
8485
enabled: true
8586
driver:
8687
instance_type: g4dn.2xlarge
88+
# instance_type: p5.48xlarge
8789
- name: intel_hpc
8890
run_list:
8991
- recipe[aws-parallelcluster-tests::setup]

cookbooks/aws-parallelcluster-platform/resources/fabric_manager/partial/_fabric_manager_common.rb

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,10 @@ def _nvidia_driver_version
6363

6464
# Get number of nv switches
6565
def get_nvswitches
66-
# NVSwitch device id is 10de:1af1
67-
nvswitch_check = Mixlib::ShellOut.new("lspci -d 10de:1af1 | wc -l")
68-
nvswitch_check.run_command
69-
nvswitch_check.stdout.strip.to_i
66+
# A100 (P4) and H100(P5) systems have NVSwitches
67+
# NVSwitch device id is 10de:1af1 for P4 instance
68+
# NVSwitch device id is 10de:22a3 for P5 instance
69+
nvswitch_check_p4 = shell_out("lspci -d 10de:1af1 | wc -l")
70+
nvswitch_check_p5 = shell_out("lspci -d 10de:22a3 | wc -l")
71+
nvswitch_check_p4.stdout.strip.to_i + nvswitch_check_p5.stdout.strip.to_i
7072
end

cookbooks/aws-parallelcluster-slurm/kitchen.slurm-config.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,10 +84,12 @@ suites:
8484
- /gpu_health_check_execution/
8585
driver:
8686
instance_type: g4dn.xlarge
87+
# instance_type: p5.48xlarge
8788
attributes:
8889
dependencies:
8990
- recipe:aws-parallelcluster-slurm::mock_slurm
9091
- resource:node_attributes
92+
# - resource:fabric_manager:configure # Needed for Multi-gpu instance like p5.48xlarge
9193
cluster:
9294
node_type: HeadNode
9395
scheduler: 'slurm'

0 commit comments

Comments
 (0)