Skip to content

Everything seems to be ok, but it doesn't work Ubuntu 24.04, Operator v25.3.0 #1398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
blumfontein opened this issue Apr 12, 2025 · 2 comments

Comments

@blumfontein
Copy link

blumfontein commented Apr 12, 2025

Hi, guys

Can't understand what is wrong in my case. Everything seems to be ok, but it just doesn't work.
Generally, my cluster is installed in latest Rancher k3s.
GPU operator v25.3.0 was added from Rancher Apps.

Here is logs:

get pods - all is up and running

:~# kubectl get pods
NAME                                                          READY   STATUS             RESTARTS      AGE
gpu-feature-discovery-vklrj                                   1/1     Running            0             22m
gpu-operator-56977fc4b6-96t6s                                 1/1     Running            0             23m
gpu-operator-node-feature-discovery-gc-78d798587d-7dldq       1/1     Running            0             23m
gpu-operator-node-feature-discovery-master-7b7b57c9f9-5mmmz   1/1     Running            0             23m
gpu-operator-node-feature-discovery-worker-7c2n9              1/1     Running            0             23m
nvidia-container-toolkit-daemonset-gggxj                      1/1     Running            0             22m
nvidia-cuda-validator-ps8fl                                   0/1     Completed          0             21m
nvidia-dcgm-exporter-c7g4l                                    1/1     Running            0             22m
nvidia-device-plugin-daemonset-7wrzw                          1/1     Running            0             22m
nvidia-mig-manager-qcgqd                                      1/1     Running            0             22m
nvidia-operator-validator-srhzw                               1/1     Running            0             22m

kubectl describe node sees GPU.

 kubectl describe node XXX

Capacity:
  cpu:                128
  ephemeral-storage:  2079140828Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1031635584Ki
  nvidia.com/gpu:     8
  pods:               110
Allocatable:
  cpu:                128
  ephemeral-storage:  2022588195893
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1031635584Ki
  nvidia.com/gpu:     8
  pods:               110

I am able to see GPU in docker:

 sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

but cri just does not work:

from documentation example

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1
:~# kubectl describe pod critest
Name:             critest
Namespace:        default
Priority:         0
Service Account:  default
Node:             localhost.localdomain/149.137.199.173
Start Time:       Sat, 12 Apr 2025 12:16:08 +0000
Labels:           <none>
Annotations:      <none>
Status:           Failed
IP:               10.42.0.185
IPs:
  IP:  10.42.0.185
Containers:
  nvidia-gpu:
    Container ID:  containerd://ade19e45be28c623f8b05923c93dc075d71f87340478a0b6f2501843c75ecd3c
    Image:         ubuntu
    Image ID:      docker.io/library/ubuntu@sha256:1e622c5f073b4f6bfad6632f2616c7f59ef256e96fe78bf6a595d1dc4376ac02
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-smi
    State:          Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "nvidia-smi": executable file not found in $PATH
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 00:00:00 +0000
      Finished:     Sat, 12 Apr 2025 12:16:09 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bjxqh (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-bjxqh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  24m   default-scheduler  Successfully assigned default/critest to localhost.localdomain
  Normal   Pulling    24m   kubelet            Pulling image "ubuntu"
  Normal   Pulled     24m   kubelet            Successfully pulled image "ubuntu" in 353ms (353ms including waiting). Image size: 29727061 bytes.
  Normal   Created    24m   kubelet            Created container: nvidia-gpu
  Warning  Failed     24m   kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "nvidia-smi": executable file not found in $PATH
@blumfontein
Copy link
Author

blumfontein commented Apr 12, 2025

I found the container toolkit-validation from the pod gpu-operator which output proper nvidia-smi response. Tried to copypaste its config to my deployment, but no luck.

@FourierMourier
Copy link

Hi! Isn't the pod supposed to use a different runtimeclass rather than default or you edited the helm chart values/cri configs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants