-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Akash Provider incorrectly removes deployments during RPC communication issues; improve resilience to intermittent RPC failures #17
Comments
interesting.. there is absolutely no resources related to
and yet, the provider keeps on reporting every few seconds:
(have bounced the hostname-operator too) |
oh, ignore it.
I've tried running that using my key:
and it produced this line, even I don't have any deployments on that provider:
ns (namespace) can be derived manually:
|
I've |
TL;DR
Issue Overview:This issue reoccurred on the Hurricane provider, running akash-provider A total of 44 deployments (worth noting - not all of them) were closed simultaneously. For detailed analysis, let's focus on dseq Relevant Logs:
Pod Restart Analysis:
Error Context:
Logs by Restart Timing:
Preliminary Conclusion:The root cause appears to stem from the cascading effects of:
Further investigation is required to determine:
|
0.1.0
removed the deployment after lease query failed ... err=(MISSING)
messagelease query failed ... err=(MISSING)
message
lease query failed ... err=(MISSING)
message
I can confirm my earlier assumption: there was indeed an intermittent issue with the RPC. Upon searching for relevant error messages, I found RPC node sync check failed in the provider logs.
@troian The Akash Provider should not assume deployments are absent simply due to RPC query failures or issues with the RPC itself. It should be designed to handle intermittent communication or RPC disruptions more resiliently without removing the existing deployments from K8s. |
The same issue occurred again on the Hurricane provider :/ |
Side note: I've opened a discussion to what might be a potential contributor (not the root cause) to this issue https://github.com/orgs/akash-network/discussions/760 |
praetor-based provider-services:
0.1.0
provider address:
akash1tweev0k42guyv3a2jtgphmgfrl2h5y2884vh9d
A provider owner
SGC#3172
reported thej0asgbmq1a6p4s7ii0tlvuoco.ingress.dcnorse.ddns.net
ingress host resource (for the DSEQ9562948
) started to return 404 and other deployments disappeared from his k8s cluster.What's interesting is that I can see in the provider logs
providerleasedips CRD does not exist
messages, which is part of the praetor-based provider starting script check (you can see it below).Which means that there is some condition where K8s cluster does not seem to be fully initialized or so.. which is causing the provider think it's got no leases (
lease query failed ... err=(MISSING)
leading to the lease removal; though, it does not close them on the blockchain [bid, lease, order, deployment staying in the active/open state]).leases are still active from the blockchain point of view
Provider logs
See more complete logs for the past
7 days
here (90Mi
) => https://transfer.sh/Fg7vTc/logs.txtAfter about
3 hours
from the provider start, the following lines started to appear in the logs:grep MISSING logs.txt
Additional information
The text was updated successfully, but these errors were encountered: