Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider stops responding over 8443/status, 8444 sporadically (either immediately after start or after some time) #190

Closed
andy108369 opened this issue Mar 7, 2024 · 4 comments
Assignees
Labels
P2 repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

Hurricane provider stops responding over 8443/status, 8444 sporadically (either immediately after start or after some time) since upgrading it from 0.4.8 to 0.5.4

NOTE: AKASH_IP_OPERATOR=false, akash ip operator helm chart not present (normally IP Leasing would be enabled, but I've disabled it as I've initially thought it was causing the problem)

nvidia-device-plugin-0.14.5     0.14.5
akash-node-9.0.0                0.30.0
provider-9.1.0                  0.5.4
akash-hostname-operator-9.0.5   0.5.4
akash-inventory-operator-9.0.5  0.5.4
ingress-nginx-4.10.0            1.10.0
rook-ceph-v1.12.4               v1.12.4
rook-ceph-cluster-v1.12.4       v1.12.4

Logs

Workarounds

I've implemented automatic provider pod restart if livenessProbe finds it cannot get the data from 8443/status, etc

Will keep monitoring the akash-provider pod restart count.

Additional notes

I have not observed this issue on any other provider except for the Hurricane provider since we've upgraded providers from 0.4.8 to 0.5.4.

@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Mar 7, 2024
@andy108369
Copy link
Contributor Author

No restarts nor issues since the last time provider was started (26hrs uptime).
I'll let it run like this for over the weekend and will enable the IP Leasing back again.

@chainzero
Copy link
Collaborator

Awaiting further testing by @andy108369 prior to further investigation

@andy108369
Copy link
Contributor Author

andy108369 commented Mar 23, 2024

Enabled the IP Leasing back again:

  1. provider.yaml
ipoperator: true
  1. installed metallb chart and applied the config
helm upgrade --install metallb metallb/metallb -n metallb-system --version 0.14.3
kubectl apply -f metallb-config.yaml
  1. installed akash-ip-operator chart
helm upgrade --install akash-ip-operator akash/akash-ip-operator -n akash-services --set provider_address=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk

@andy108369
Copy link
Contributor Author

andy108369 commented Apr 4, 2024

can't see this issue any longer with provider 0.5.11
closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

2 participants