-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inventory-operator: doesn't detect when nvdp-nvidia-device-plugin
marks GPU as unhealthy
#249
Labels
Comments
chainzero
added
repo/provider
Akash provider-services repo issues
and removed
awaiting-triage
labels
Aug 21, 2024
This issue appears to be resolved by
|
Additional note - issue originally cited improper reporting via the status endpoint. Proof that endpoint also reports proper allocatable/available GPU numbers with fixes in this RC: And that endpoint - as we would expect - is reporting valid numbers now as well. Such as:
|
troian
added a commit
to akash-network/provider
that referenced
this issue
Dec 4, 2024
k8s node capabilities.allocatable may go to 0 when devices via device plugin become unavailable due to various issues. recalculate the current node's inventory if any of the fields in capabilities.allocatable is changed refs akash-network/support#249 Signed-off-by: Artur Troian <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Logs https://gist.github.com/andy108369/cac9f968f1c6a3eb7c6e92135b8afd42
querying 8443/status endpoint would report all 8 GPUs are available, but at least one was marked as unhealthy.
Rarely you can recover from this error by bouncing
nvdp-nvidia-device-plugin
pod on the node where it was marked unhealthy.But the point is that inventory-operator should ideally detect this as otherwise GPU deployments will be stuck in "Pending" until all 8 GPUs will become available again:
The text was updated successfully, but these errors were encountered: