-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdx.nb.akash.pub - providers spewing thousands of pods; nvidia-smi Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error
#209
Comments
Have also asked Netdata to add the alert on "GPU has fallen off the bus" message |
GPU is back after reboot:
One deployment can't seem to spawn
|
I've asked the owner to redeploy that dseq |
TODO
|
The issue reoccurred 3rd time:
I've cordoned this node1.pdx.nb so it won't be participating in the provider's resource scheduling until it gets fixed by the provider. |
Bug report submitted with the GPU crash dump data => https://forums.developer.nvidia.com/t/xid-79-error-gpu-falls-off-bus-with-nvidia-driver-535-161-07-on-ubuntu-22-04-lts-server/288976 |
NebulaBlock is going to replace the node1.pdx.nb.akash.pub server from 9:30am to 11:30am PT time in order to fix the 4090 GPU issue. I've scaled the akash-provider service down until that's complete. https://discord.com/channels/747885925232672829/1111749348351553587/1227292077369589842 |
The node1.pdx.nb.akash.pub server has been successfully replaced - the server (mainboard), the 8x 4090's GPU's and 1x 1.75T disk (used for ceph) 2x 7T (raid1) disks (rootfs) - were kept. Good news: rook-ceph (Akash's persistent storage) picked up the new 1.75T disk on the new node1.pdx correctly!
it is currently copying the replicas (pg's) to it :slight_smile: I've updated the nvidia ticket: Will reopen this issue if it reoccurs. |
Reason - GPU issue on
node1
node1
node2
node3
The text was updated successfully, but these errors were encountered: