pdx.nb.akash.pub - providers spewing thousands of pods; nvidia-smi `Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error` #209

andy108369 · 2024-04-06T22:11:11Z

qrka4mab6esns6blt8jaeos663j0e6sbp9cfghi02jc4i   service-1-84c67b446-kmdm4                         1/1     Running     0               8m43s
dq0albqkms9kennoo5j6h01p18banva2bco1gbq2ddriq   service-1-7f598bbc5c-rnwb6                        1/1     Running     0               8m10s
tsu3ue9housp0ehjsr51psu4aambpaqvtpuninpl07hqs   service-1-55fd66f6f5-l2bf5                        1/1     Running     0               8m3s
7gtgt1h8rr21k08c4n4t5lhhf06ggrv2uervfl13h5v56   service-1-5fcf94fc4c-6fv8t                        0/1     Pending     0               5m23s
eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g   service-1-84988c5fb6-b9hx4                        0/1     Pending     0               5m16s
3umvk5ct5vuq4fl2h3o56kslcfe6gh3jse6klk64vpa2k   service-1-775cc5f59d-bq9zr                        0/1     Pending     0               3m34s
g5i1ml6bhnfso9faglp1gv167f8acegsv335hlkq0dlfc   service-1-7d66cdd98c-hvxc8                        0/1     Pending     0               104s
feb2pachqnknvuhjrgqb0pvve0egut5t7d00ur9u30nd8   service-1-6cd8f6b6f4-cdpcf                        0/1     Pending     0               97s
mp60hei3dsq8lnn3k21u1bfv87pjmei31dobelrmlvqas   service-1-84897bbdb4-6fkt2                        0/1     Pending     0               91s
6pgohd98lm7gs5rb2kv5bnc4c9920jtvfvmg4ikqvhn8a   service-1-55858fc545-4nfm7                        0/1     Pending     0               35s
(venv) root@node1:~/kubespray# ns=6pgohd98lm7gs5rb2kv5bnc4c9920jtvfvmg4ikqvhn8a
(venv) root@node1:~/kubespray# kubectl -n $ns get pods
NAME                         READY   STATUS    RESTARTS   AGE
service-1-55858fc545-4nfm7   0/1     Pending   0          42s

(venv) root@node1:~/kubespray# kubectl -n $ns describe  pods
Name:                service-1-55858fc545-4nfm7
Namespace:           6pgohd98lm7gs5rb2kv5bnc4c9920jtvfvmg4ikqvhn8a
Priority:            0
Runtime Class Name:  nvidia
Service Account:     default
Node:                <none>
Labels:              akash.network=true
                     akash.network/manifest-service=service-1
                     akash.network/namespace=6pgohd98lm7gs5rb2kv5bnc4c9920jtvfvmg4ikqvhn8a
                     pod-template-hash=55858fc545
Annotations:         <none>
Status:              Pending
IP:                  
IPs:                 <none>
Controlled By:       ReplicaSet/service-1-55858fc545
Containers:
  service-1:
    Image:       0lav/nimble-miner-public
    Ports:       22/TCP, 80/TCP
    Host Ports:  0/TCP, 0/TCP
...
...
      AKASH_GROUP_SEQUENCE:           1
      AKASH_DEPLOYMENT_SEQUENCE:      15732892
      AKASH_ORDER_SEQUENCE:           1
      AKASH_OWNER:                    akash1rpcl7spcemj9w0qyd4sweqa9chh9j3y622e2lp
      AKASH_PROVIDER:                 akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv
      AKASH_CLUSTER_PUBLIC_HOSTNAME:  provider.pdx.nb.akash.pub
    Mounts:                           <none>
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:         <none>
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  46s   default-scheduler  0/3 nodes are available: 1 Insufficient cpu, 3 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..

Reason - GPU issue on `node1`

node1

(venv) root@node1:~/kubespray# nvidia-smi 
Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error

root@node1:~# dmesg -T | grep NVRM
[Thu Apr  4 18:03:56 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.161.07  Sat Feb 17 22:55:48 UTC 2024
[Fri Apr  5 21:29:39 2024] NVRM: GPU at PCI:0000:a1:00: GPU-979426f2-893a-7cbb-c4cf-81472f89a462
[Fri Apr  5 21:29:39 2024] NVRM: Xid (PCI:0000:a1:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Fri Apr  5 21:29:39 2024] NVRM: GPU 0000:a1:00.0: GPU has fallen off the bus.
[Fri Apr  5 21:29:39 2024] NVRM: A GPU crash dump has been created. If possible, please run
                           NVRM: nvidia-bug-report.sh as root to collect this data before
                           NVRM: the NVIDIA kernel module is unloaded.

root@node1:~# lsmod |grep nvidia
nvidia_uvm           1527808  4
nvidia_drm             77824  0
nvidia_modeset       1306624  1 nvidia_drm
nvidia              56717312  188 nvidia_uvm,nvidia_modeset
drm_kms_helper        311296  5 drm_vram_helper,ast,nvidia_drm
drm                   622592  8 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,nvidia_drm,ttm
root@node1:~# 

root@node1:~# nvidia-bug-report.sh

nvidia-bug-report.sh will now collect information about your
system and create the file 'nvidia-bug-report.log.gz' in the current
directory.  It may take several seconds to run.  In some
cases, it may hang trying to capture data generated dynamically
by the Linux kernel and/or the NVIDIA kernel module.  While
the bug report log file will be incomplete if this happens, it
may still contain enough data to diagnose your problem.

If nvidia-bug-report.sh hangs, consider running with the --safe-mode
and --extra-system-data command line arguments.

Please include the 'nvidia-bug-report.log.gz' log file when reporting
your bug via the NVIDIA Linux forum (see forums.developer.nvidia.com)
or by sending email to '[email protected]'.

By delivering 'nvidia-bug-report.log.gz' to NVIDIA, you acknowledge
and agree that personal information may inadvertently be included in
the output.  Notwithstanding the foregoing, NVIDIA will use the
output only for the purpose of investigating your reported issue.

Running nvidia-bug-report.sh... complete.

root@node1:~# echo $?
0

root@node1:~# ls -latrh nvidia-bug-report.log.gz
-rw-r--r-- 1 root root 1.7M Apr  6 22:14 nvidia-bug-report.log.gz
root@node1:~# sha256sum nvidia-bug-report.log.gz
f97ff07c1d640c7321ed9e2023d04dee192ed4c3dda34381c75cee9547d7b744  nvidia-bug-report.log.gz
root@node1:~#

node2

$ ssh [email protected]
Last login: Sat Apr  6 22:00:30 2024 from 81.0.244.22
root@node2:~# nvidia-smi 
Sat Apr  6 22:10:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0 Off |                  Off |
|  0%   57C    P2             376W / 450W |   3987MiB / 24564MiB |     92%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:25:00.0 Off |                  Off |
|  0%   28C    P8              10W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off | 00000000:41:00.0 Off |                  Off |
|  0%   47C    P2             287W / 450W |   3341MiB / 24564MiB |     88%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off | 00000000:61:00.0 Off |                  Off |
|  0%   46C    P2             265W / 450W |   3339MiB / 24564MiB |     90%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        Off | 00000000:81:00.0 Off |                  Off |
|  0%   53C    P2             371W / 450W |   3987MiB / 24564MiB |     93%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        Off | 00000000:A1:00.0 Off |                  Off |
|  0%   27C    P8              17W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        Off | 00000000:C1:00.0 Off |                  Off |
|  0%   48C    P2             268W / 450W |   3343MiB / 24564MiB |     72%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090        Off | 00000000:E1:00.0 Off |                  Off |
|  0%   46C    P2             267W / 450W |   3339MiB / 24564MiB |     90%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    290822      C   python                                     3980MiB |
|    2   N/A  N/A     61590      C   python                                     3334MiB |
|    3   N/A  N/A     61586      C   python                                     3332MiB |
|    4   N/A  N/A    322044      C   python                                     3980MiB |
|    6   N/A  N/A     61644      C   python                                     3336MiB |
|    7   N/A  N/A     23934      C   python                                     3332MiB |
+---------------------------------------------------------------------------------------+

node3

$ ssh [email protected]
Last login: Sat Apr  6 22:00:33 2024 from 81.0.244.22
root@node3:~# nvidia-smi 
Sat Apr  6 22:10:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0 Off |                  Off |
|  0%   28C    P8              11W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:25:00.0 Off |                  Off |
|  0%   26C    P8               8W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off | 00000000:41:00.0 Off |                  Off |
|  0%   46C    P2             269W / 450W |   3347MiB / 24564MiB |     89%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off | 00000000:61:00.0 Off |                  Off |
|  0%   26C    P8              16W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        Off | 00000000:81:00.0 Off |                  Off |
|  0%   27C    P8              18W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        Off | 00000000:A1:00.0 Off |                  Off |
|  0%   44C    P2             279W / 450W |   3347MiB / 24564MiB |     90%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        Off | 00000000:C1:00.0 Off |                  Off |
|  0%   25C    P8              15W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090        Off | 00000000:E1:00.0 Off |                  Off |
|  0%   25C    P8               8W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    2   N/A  N/A    369534      C   python                                     3340MiB |
|    5   N/A  N/A    369621      C   python                                     3340MiB |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

andy108369 · 2024-04-06T22:27:06Z

Have also asked Netdata to add the alert on "GPU has fallen off the bus" message
netdata/netdata#17331

andy108369 · 2024-04-06T22:33:55Z

GPU is back after reboot:

root@node1:~# nvidia-smi 
Sat Apr  6 22:27:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0 Off |                  Off |
|  0%   27C    P8              16W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:25:00.0 Off |                  Off |
|  0%   26C    P8               9W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off | 00000000:41:00.0 Off |                  Off |
|  0%   25C    P8              11W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off | 00000000:61:00.0 Off |                  Off |
|  0%   27C    P8               8W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        Off | 00000000:81:00.0 Off |                  Off |
|  0%   25C    P8              13W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        Off | 00000000:A1:00.0 Off |                  Off |
|  0%   25C    P8              14W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        Off | 00000000:C1:00.0 Off |                  Off |
|  0%   25C    P8              14W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090        Off | 00000000:E1:00.0 Off |                  Off |
|  0%   25C    P8              11W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@node1:~# dmesg -T | grep NVRAM
root@node1:~#

One deployment can't seem to spawn

provider's got 24 4090's GPUs (3 nodes, each has 8x 4090's GPUs)
there are total of 24 services using GPU's, however 24th can't spawn null -> because it wants 2 GPU's but provider can't offer only 1 GPU; hence it can't find a single node to offer 2 GPU's.

root@node1:~# kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide |grep -vw Running
NAMESPACE                                       NAME                                              READY   STATUS      RESTARTS        AGE     IP               NODE     NOMINATED NODE   READINESS GATES
rook-ceph                                       rook-ceph-osd-prepare-node2-bhdvz                 0/1     Completed   0               4m5s    10.233.75.42     node2    <none>           <none>
rook-ceph                                       rook-ceph-osd-prepare-node3-7j49n                 0/1     Completed   0               4m2s    10.233.71.10     node3    <none>           <none>
eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g   service-1-84988c5fb6-9fm2t                        0/1     Pending     0               2m10s   <none>           <none>   <none>           <none>
root@node1:~# pods_json=$(kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o json)

( echo -e "NAMESPACE\tNAME\t\t\t\tREADY\tSTATUS\tRESTARTS\tAGE\tGPU\tNODE"; echo "$pods_json" | jq -r '.items[] | select(.spec.containers[].resources.requests."nvidia.com/gpu" != null) | "\(.metadata.namespace)\t\(.metadata.name)\t\(.status.containerStatuses[0].ready)/\(.spec.containers | length)\t\(.status.phase)\t\(.status.containerStatuses[0].restartCount)\t\(.metadata.creationTimestamp)\tYES\t\(.spec.nodeName)"' ) | column -t
NAMESPACE                                      NAME                        READY   STATUS   RESTARTS  AGE                   GPU  NODE
rfij4esvggf9cqqnpf2hq266o0nba01t5iq918bu1v9iu  service-1-58bf676fdc-d9hqt  true/1  Running  0         2024-04-04T18:06:33Z  YES  node2
1qhlsoi0sqj2rot1otov7vhfao2j0cnmbuvkj2qd16ese  service-1-76f8b9cf6d-pfl8g  true/1  Running  1         2024-04-04T18:06:33Z  YES  node2
d62adnou0v7b5s7h3t8gnh0av540fcok9bk56u72f3je2  service-1-7cffd45f48-mjmc4  true/1  Running  1         2024-04-04T18:06:33Z  YES  node2
guf6r2fhenpfljbhncip9sbei3ss43av4kaau95kl4rpq  service-1-598c857c89-7xmw7  true/1  Running  0         2024-04-04T18:06:33Z  YES  node2
qg9lq6q8tcta1p2m9fuc1pdbjfispht8q7e7iun6t5s2e  service-1-59974dfd89-qjpzg  true/1  Running  1         2024-04-04T18:06:33Z  YES  node2
6ng2gu6vf5p8qg5bde5udse1e34igb1bn15kaeupiuhva  service-1-59c44cd758-bgk9n  true/1  Running  0         2024-04-04T18:06:34Z  YES  node2
rpurv2kf5qurt2ibliluk6e4sr627d0uopbtpqbuh1b6m  service-1-5f48b8bbc4-dfdch  true/1  Running  0         2024-04-04T18:16:00Z  YES  node2
a3me6oo1kknteim3e4g0md5eanhvpeib3h42bp7au4slm  service-1-7f97dc6b94-2qxcc  true/1  Running  0         2024-04-04T18:17:43Z  YES  node2
of3uincqjlja5ekk8cbfuormpm4dmn8v403c535f4dc4m  service-1-7dd5dffbc4-tqdh5  true/1  Running  0         2024-04-06T21:58:39Z  YES  node3
3n3mvl6qqh1bkk41pou3dkfttkjglo6tua36udmh6n4fm  service-1-dd746bf44-t22qv   true/1  Running  0         2024-04-06T21:58:39Z  YES  node3
q79th1d7qcblh9paf7ivfnkpg3d8fd5uaa7ifn4ojt8qc  service-1-7bccf77c78-rxfvs  true/1  Running  0         2024-04-06T21:58:41Z  YES  node3
1jr4gic7tq72avp86id8ck2r0m6ppn5g9qgauqtibhtlk  service-1-5f9b59549b-ftwm6  true/1  Running  0         2024-04-06T21:58:41Z  YES  node3
mru7v16b64p70p7gd15tav6nkot7729vdeaogvnljc7qq  service-1-7f694ffdf5-qlbjz  true/1  Running  0         2024-04-06T21:58:41Z  YES  node3
qrka4mab6esns6blt8jaeos663j0e6sbp9cfghi02jc4i  service-1-84c67b446-kmdm4   true/1  Running  0         2024-04-06T21:59:13Z  YES  node3
dq0albqkms9kennoo5j6h01p18banva2bco1gbq2ddriq  service-1-7f598bbc5c-rnwb6  true/1  Running  0         2024-04-06T21:59:46Z  YES  node3
tsu3ue9housp0ehjsr51psu4aambpaqvtpuninpl07hqs  service-1-55fd66f6f5-l2bf5  true/1  Running  0         2024-04-06T21:59:53Z  YES  node3
3umvk5ct5vuq4fl2h3o56kslcfe6gh3jse6klk64vpa2k  service-1-775cc5f59d-bvfv9  true/1  Running  0         2024-04-06T22:15:23Z  YES  node1
7gtgt1h8rr21k08c4n4t5lhhf06ggrv2uervfl13h5v56  service-1-5fcf94fc4c-vsp2x  true/1  Running  0         2024-04-06T22:15:23Z  YES  node1
g5i1ml6bhnfso9faglp1gv167f8acegsv335hlkq0dlfc  service-1-7d66cdd98c-7z7hj  true/1  Running  0         2024-04-06T22:15:24Z  YES  node1
feb2pachqnknvuhjrgqb0pvve0egut5t7d00ur9u30nd8  service-1-6cd8f6b6f4-znxt5  true/1  Running  0         2024-04-06T22:25:19Z  YES  node1
ht63ne5q8esd6kh79uts05u39e03t7gfp86mbqin1shpk  service-1-858bc896b4-v2xsf  true/1  Running  0         2024-04-06T22:25:24Z  YES  node1
mp60hei3dsq8lnn3k21u1bfv87pjmei31dobelrmlvqas  service-1-84897bbdb4-ckrqp  true/1  Running  0         2024-04-06T22:25:43Z  YES  node1
6pgohd98lm7gs5rb2kv5bnc4c9920jtvfvmg4ikqvhn8a  service-1-55858fc545-msrts  true/1  Running  0         2024-04-06T22:26:11Z  YES  node1
eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g  service-1-84988c5fb6-9fm2t  null/1  Pending  null      2024-04-06T22:28:30Z  YES  null
root@node1:~# 
root@node1:~# ( echo -e "NAMESPACE\tNAME\t\t\t\tREADY\tSTATUS\tRESTARTS\tAGE\tGPU\tNODE"; echo "$pods_json" | jq -r '.items[] | select(.spec.containers[].resources.requests."nvidia.com/gpu" != null) | "\(.metadata.namespace)\t\(.metadata.name)\t\(.status.containerStatuses[0].ready)/\(.spec.containers | length)\t\(.status.phase)\t\(.status.containerStatuses[0].restartCount)\t\(.metadata.creationTimestamp)\tYES\t\(.spec.nodeName)"' ) | column -t | grep -v ^NAMESPACE | wc -l
24
root@node1:~# kubectl -n $ns describe pod service-1-84988c5fb6-9fm2t
Name:                service-1-84988c5fb6-9fm2t
Namespace:           eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g
Priority:            0
Runtime Class Name:  nvidia
Service Account:     default
Node:                <none>
Labels:              akash.network=true
                     akash.network/manifest-service=service-1
                     akash.network/namespace=eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g
                     pod-template-hash=84988c5fb6
Annotations:         <none>
Status:              Pending
IP:                  
IPs:                 <none>
Controlled By:       ReplicaSet/service-1-84988c5fb6
Containers:
  service-1:
    Image:       0lav/nimble-miner-public
    Ports:       22/TCP, 80/TCP
    Host Ports:  0/TCP, 0/TCP
    Command:
      bash
      -c
    Args:
      apt-get update ; apt-get upgrade -y ; apt install -y ssh; echo "PermitRootLogin yes" >> /etc/ssh/sshd_config ; (echo $SSH_PASS; echo $SSH_PASS) | passwd root ;  service ssh start; echo ==== ssh user:"root" === ; echo === ssh pass:"$SSH_PASS" === ; sleep infinity
    Limits:
      cpu:                16
      ephemeral-storage:  120G
      memory:             16G
      nvidia.com/gpu:     1
    Requests:
      cpu:                16
      ephemeral-storage:  120G
      memory:             16G
      nvidia.com/gpu:     1
    Environment:
      SSH_PASS:                       REDACTED
      AKASH_GROUP_SEQUENCE:           1
      AKASH_DEPLOYMENT_SEQUENCE:      15732906
      AKASH_ORDER_SEQUENCE:           1
      AKASH_OWNER:                    akash1rpcl7spcemj9w0qyd4sweqa9chh9j3y622e2lp
      AKASH_PROVIDER:                 akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv
      AKASH_CLUSTER_PUBLIC_HOSTNAME:  provider.pdx.nb.akash.pub
    Mounts:                           <none>
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:         <none>
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m30s  default-scheduler  0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
root@node1:~#

andy108369 · 2024-04-06T22:34:54Z

I've asked the owner to redeploy that dseq 15732906 (close & redeploy anew)

andy108369 · 2024-04-06T22:42:18Z

TODO

open a bug report for nvidia based on the logs collected ( node1.pdx.nb.akash.pub:/root/nvidia-bug-report.log.gz )

andy108369 · 2024-04-08T17:07:47Z

Issue reoccurred

andy108369 · 2024-04-09T08:44:11Z

The issue reoccurred 3rd time:

root@node1:~# uptime 
 08:38:19 up 15:34,  1 user,  load average: 6.49, 7.16, 7.32
root@node1:~# dmesg -T | grep NVRM
[Mon Apr  8 17:04:05 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.161.07  Sat Feb 17 22:55:48 UTC 2024
[Tue Apr  9 01:21:38 2024] NVRM: GPU at PCI:0000:a1:00: GPU-979426f2-893a-7cbb-c4cf-81472f89a462
[Tue Apr  9 01:21:38 2024] NVRM: Xid (PCI:0000:a1:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Tue Apr  9 01:21:38 2024] NVRM: GPU 0000:a1:00.0: GPU has fallen off the bus.
[Tue Apr  9 01:21:38 2024] NVRM: A GPU crash dump has been created. If possible, please run
                           NVRM: nvidia-bug-report.sh as root to collect this data before
                           NVRM: the NVIDIA kernel module is unloaded.

I've cordoned this node1.pdx.nb so it won't be participating in the provider's resource scheduling until it gets fixed by the provider.

andy108369 · 2024-04-09T09:19:33Z

Bug report submitted with the GPU crash dump data => https://forums.developer.nvidia.com/t/xid-79-error-gpu-falls-off-bus-with-nvidia-driver-535-161-07-on-ubuntu-22-04-lts-server/288976

andy108369 · 2024-04-09T16:38:30Z

NebulaBlock is going to replace the node1.pdx.nb.akash.pub server from 9:30am to 11:30am PT time in order to fix the 4090 GPU issue.

I've scaled the akash-provider service down until that's complete.
This will inevitably drop the total 4090 GPU count by 24 in the stats page https://akash.network/gpus/ until the provider is back up again.

https://discord.com/channels/747885925232672829/1111749348351553587/1227292077369589842

andy108369 · 2024-04-09T18:39:44Z

The node1.pdx.nb.akash.pub server has been successfully replaced - the server (mainboard), the 8x 4090's GPU's and 1x 1.75T disk (used for ceph)

2x 7T (raid1) disks (rootfs) - were kept.

Good news: rook-ceph (Akash's persistent storage) picked up the new 1.75T disk on the new node1.pdx correctly!

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         5.23979  root default                             
-5         1.74660      host node1                           
 0   nvme  1.74660          osd.0       up   1.00000  1.00000
-7         1.74660      host node2                           
 1   nvme  1.74660          osd.1       up   1.00000  1.00000
-3         1.74660      host node3                           
 2   nvme  1.74660          osd.2       up   1.00000  1.00000

it is currently copying the replicas (pg's) to it :slight_smile:

I've updated the nvidia ticket:
https://forums.developer.nvidia.com/t/xid-79-error-rtx-4090-gpu-falls-off-bus-with-nvidia-driver-535-161-07-on-ubuntu-22-04-lts-server/288976/2?u=andrey.arapov

Will reopen this issue if it reoccurs.

andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Apr 6, 2024

andy108369 closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdx.nb.akash.pub - providers spewing thousands of pods; nvidia-smi `Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error` #209

pdx.nb.akash.pub - providers spewing thousands of pods; nvidia-smi `Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error` #209

andy108369 commented Apr 6, 2024 •

edited

Loading

andy108369 commented Apr 6, 2024

andy108369 commented Apr 6, 2024

andy108369 commented Apr 6, 2024

andy108369 commented Apr 6, 2024 •

edited

Loading

andy108369 commented Apr 8, 2024 •

edited

Loading

andy108369 commented Apr 9, 2024

andy108369 commented Apr 9, 2024

andy108369 commented Apr 9, 2024

andy108369 commented Apr 9, 2024

pdx.nb.akash.pub - providers spewing thousands of pods; nvidia-smi Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error #209

pdx.nb.akash.pub - providers spewing thousands of pods; nvidia-smi Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error #209

Comments

andy108369 commented Apr 6, 2024 • edited Loading

Reason - GPU issue on node1

node1

node2

node3

andy108369 commented Apr 6, 2024

andy108369 commented Apr 6, 2024

One deployment can't seem to spawn

andy108369 commented Apr 6, 2024

andy108369 commented Apr 6, 2024 • edited Loading

TODO

andy108369 commented Apr 8, 2024 • edited Loading

andy108369 commented Apr 9, 2024

andy108369 commented Apr 9, 2024

andy108369 commented Apr 9, 2024

andy108369 commented Apr 9, 2024

pdx.nb.akash.pub - providers spewing thousands of pods; nvidia-smi `Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error` #209

pdx.nb.akash.pub - providers spewing thousands of pods; nvidia-smi `Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error` #209

andy108369 commented Apr 6, 2024 •

edited

Loading

Reason - GPU issue on `node1`

andy108369 commented Apr 6, 2024 •

edited

Loading

andy108369 commented Apr 8, 2024 •

edited

Loading