akash-provider: report Pod Exit Code
, Restart Count
, Time running since last restart
#246
Open
2 tasks done
Labels
Is your feature request related to a problem? Please describe.
Current
vllm
deployments would just exit with -15 (SIGTERM) exit code, making it hard for the users to realize that the root cause for the issue is the Pod reaching its max. memory limit set in SDL.The following chain of events happens:
-15
exit code is for SIGTERM(Click to expand) Complete deployment log
137
exit code at the pod levelDescribe the solution you'd like
Provider should read the last pod's Exit Code (as well as the time) as well as the Restart Count.
Ideally also report the time the deployment has been running since last restart.
All this info should be obtained via the akash-provider process itself (akash-provider needs to query K8s to obtain this data).
This data should not be recorded on the blockchain of course. (as this will bloat the chain and render in unnecessary txs/fees)
32m ago
):Describe alternatives you've considered
N/A
Search
Code of Conduct
Additional context
Unfortunately,
lease-events
does not report this information.The text was updated successfully, but these errors were encountered: