Skip to content

Latest commit

 

History

History
136 lines (104 loc) · 4.76 KB

apply-nvidia-runtime-engine.md

File metadata and controls

136 lines (104 loc) · 4.76 KB

Apply NVIDIA Runtime Engine

NOTE - conduct these steps on the control plane node that Helm was installed on via the previous step

Create RuntimeClass

Create the NVIDIA Runtime Config

cat > nvidia-runtime-class.yaml << EOF
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
  name: nvidia
handler: nvidia
EOF

Apply the NVIDIA Runtime Config

kubectl apply -f nvidia-runtime-class.yaml

Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on All Nodes

NOTE - in some scenarios a provider may host GPUs only on a subset of Kubernetes worker nodes. Use the instructions in this section if ALL Kubernetes worker nodes have available GPU resources. If only a subset of worker nodes host GPU resources - use the section Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on Subset of Nodes instead. Only one of these two sections should be completed.

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.14.5 \
  --set runtimeClassName="nvidia" \
  --set deviceListStrategy=volume-mounts

Expected/Example Output

root@ip-172-31-8-172:~# helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.14.5 \
  --set runtimeClassName="nvidia" \
  --set deviceListStrategy=volume-mounts
  
Release "nvdp" does not exist. Installing it now.
NAME: nvdp
LAST DEPLOYED: Thu Apr 13 19:11:28 2023
NAMESPACE: nvidia-device-plugin
STATUS: deployed
REVISION: 1
TEST SUITE: None

Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on Subset of Nodes

NOTE - use the instructions in this section if only a subset of Kubernetes worker nodes have available GPU resources.

  • By default, the nvidia-device-plugin DaemonSet may run on all nodes in your Kubernetes cluster. If you want to restrict its deployment to only GPU-enabled nodes, you can leverage Kubernetes node labels and selectors.
  • Specifically, you can use the allow-nvdp=true label to limit where the DaemonSet is scheduled.

STEP 1: Label the GPU Nodes

  • First, identify your GPU nodes and label them with allow-nvdp=true. You can do this by running the following command for each GPU node
  • Replace node-name of the node you're labeling

NOTE - if you are unsure of the <node-name> to be used in this command - issue kubectl get nodes from one of your Kubernetes control plane nodes to obtain via the NAME column of this command output

kubectl label nodes <node-name> allow-nvdp=true

STEP 2: Update Helm Chart Values

  • By setting the node selector, you are ensuring that the nvidia-device-plugin DaemonSet will only be scheduled on nodes with the allow-nvdp=true label.
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.14.5 \
  --set runtimeClassName="nvidia" \
  --set deviceListStrategy=volume-mounts \
  --set-string nodeSelector.allow-nvdp="true"

STEP 3: Verify

kubectl -n nvidia-device-plugin get pods -o wide

Expected/Example Output

  • In this example only nodes: node1, node3 and node4 have the allow-nvdp=true labels and that's where nvidia-device-plugin pods spawned at:
root@node1:~# kubectl -n nvidia-device-plugin get pods -o wide

NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
nvdp-nvidia-device-plugin-gqnm2   1/1     Running   0          11s   10.233.75.1   node2   <none>           <none>

Verification - Applicable to all Environments

kubectl -n nvidia-device-plugin logs -l app.kubernetes.io/instance=nvdp

Example/Expected Output

 root@node1:~# kubectl -n nvidia-device-plugin logs -l app.kubernetes.io/instance=nvdp
  "sharing": {
    "timeSlicing": {}
  }
}
2023/04/14 14:18:27 Retreiving plugins.
2023/04/14 14:18:27 Detected NVML platform: found NVML library
2023/04/14 14:18:27 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/04/14 14:18:27 Starting GRPC server for 'nvidia.com/gpu'
2023/04/14 14:18:27 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/04/14 14:18:27 Registered device plugin for 'nvidia.com/gpu' with Kubelet
  "sharing": {
    "timeSlicing": {}
  }
}
2023/04/14 14:18:29 Retreiving plugins.
2023/04/14 14:18:29 Detected NVML platform: found NVML library
2023/04/14 14:18:29 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/04/14 14:18:29 Starting GRPC server for 'nvidia.com/gpu'
2023/04/14 14:18:29 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/04/14 14:18:29 Registered device plugin for 'nvidia.com/gpu' with Kubelet