We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
current limit of 512MiB set to akash-inventory-operator might be too small https://github.com/akash-network/helm-charts/blob/provider-8.0.3/charts/akash-inventory-operator/templates/deployment.yaml#L36
512MiB
akash-inventory-operator
$ kubectl -n akash-services get pods NAME READY STATUS RESTARTS AGE akash-hostname-operator-6795445db-jf46g 1/1 Running 0 27m akash-inventory-operator-75d7758b86-kqk6s 1/1 Running 4 (68s ago) 29m akash-node-1-0 1/1 Running 0 48m akash-provider-0 1/1 Running 0 26m
root@node3:~# dmesg -T -l alert -l crit -l emerg -l err ... [Tue Feb 20 19:10:18 2024] Memory cgroup out of memory: Killed process 80061 (provider-servic) total-vm:5025880kB, anon-rss:517808kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1620kB oom_score_adj:999 [Tue Feb 20 19:17:12 2024] Memory cgroup out of memory: Killed process 125599 (provider-servic) total-vm:5173344kB, anon-rss:516256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1636kB oom_score_adj:999 [Tue Feb 20 19:17:12 2024] Memory cgroup out of memory: Killed process 125701 (provider-servic) total-vm:5173344kB, anon-rss:516256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1636kB oom_score_adj:999 [Tue Feb 20 19:24:38 2024] Memory cgroup out of memory: Killed process 166260 (provider-servic) total-vm:5322344kB, anon-rss:517112kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1660kB oom_score_adj:999 [Tue Feb 20 19:24:38 2024] Memory cgroup out of memory: Killed process 166287 (provider-servic) total-vm:5322344kB, anon-rss:517112kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1660kB oom_score_adj:999
provider: sg.lneq
sg.lneq
$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image' NAME IMAGE akash-hostname-operator-6795445db-jf46g ghcr.io/akash-network/provider:0.4.8 akash-inventory-operator-544c75d855-qs8lh ghcr.io/akash-network/provider:0.4.8 akash-node-1-0 ghcr.io/akash-network/node:0.30.0 akash-provider-0 ghcr.io/akash-network/provider:0.4.8
$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready control-plane 71m v1.28.6 10.74.43.129 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11 node2 Ready control-plane 71m v1.28.6 10.74.43.133 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11 node3 Ready <none> 69m v1.28.6 10.74.43.131 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11 node4 Ready <none> 69m v1.28.6 10.8.68.129 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status cluster: id: 69d6af8d-3dfa-47cd-8f6e-bcbc5320987f health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 42m) mgr: b(active, since 39m), standbys: a osd: 8 osds: 8 up (since 40m), 8 in (since 40m) data: pools: 2 pools, 257 pgs objects: 7 objects, 577 KiB usage: 4.9 GiB used, 28 TiB / 28 TiB avail pgs: 257 active+clean $ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 27.94470 root default -5 6.98618 host node1 2 nvme 3.49309 osd.2 up 1.00000 1.00000 5 nvme 3.49309 osd.5 up 1.00000 1.00000 -9 6.98618 host node2 6 nvme 3.49309 osd.6 up 1.00000 1.00000 7 nvme 3.49309 osd.7 up 1.00000 1.00000 -3 6.98618 host node3 1 nvme 3.49309 osd.1 up 1.00000 1.00000 4 nvme 3.49309 osd.4 up 1.00000 1.00000 -7 6.98618 host node4 0 nvme 3.49309 osd.0 up 1.00000 1.00000 3 nvme 3.49309 osd.3 up 1.00000 1.00000
Interestingly, comparing this provider sg.lneq to sg.lnlm, the latter doesn't experience this issue:
sg.lnlm
$ kubectl -n akash-services get pods NAME READY STATUS RESTARTS AGE akash-hostname-operator-6795445db-5xhq5 1/1 Running 0 5d7h akash-inventory-operator-75d7758b86-gh2wj 1/1 Running 0 5d6h akash-node-1-0 1/1 Running 0 5d7h akash-provider-0 1/1 Running 0 21h $ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status cluster: id: 661a3fe0-5ff2-4575-a421-f812501f463c health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 5d) mgr: a(active, since 5d), standbys: b osd: 6 osds: 6 up (since 5d), 6 in (since 5d) data: pools: 2 pools, 257 pgs objects: 491 objects, 889 MiB usage: 9.8 GiB used, 5.2 TiB / 5.2 TiB avail pgs: 257 active+clean io: client: 341 B/s wr, 0 op/s rd, 0 op/s wr $ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 5.23975 root default -3 1.74658 host node1 0 nvme 0.87329 osd.0 up 1.00000 1.00000 2 nvme 0.87329 osd.2 up 1.00000 1.00000 -5 1.74658 host node2 1 nvme 0.87329 osd.1 up 1.00000 1.00000 4 nvme 0.87329 osd.4 up 1.00000 1.00000 -7 1.74658 host node3 3 nvme 0.87329 osd.3 up 1.00000 1.00000 5 nvme 0.87329 osd.5 up 1.00000 1.00000 $ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready control-plane 5d8h v1.28.6 192.168.0.100 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.7.11 node2 Ready control-plane 5d8h v1.28.6 192.168.0.101 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.7.11 node3 Ready <none> 5d8h v1.28.6 192.168.0.102 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.7.11
I've lifted the RAM limit to 1GiB to see if it helps.
1GiB
The text was updated successfully, but these errors were encountered:
fix(operator/inventory): raise memlimit
609b219
fixes akash-network/support#185
fix(operator/inventory): raise memlimit (#248)
6e4392d
andy108369
Successfully merging a pull request may close this issue.
current limit of
512MiB
set toakash-inventory-operator
might be too smallhttps://github.com/akash-network/helm-charts/blob/provider-8.0.3/charts/akash-inventory-operator/templates/deployment.yaml#L36
Env
Observation: 3 vs 4 nodes
Interestingly, comparing this provider
sg.lneq
tosg.lnlm
, the latter doesn't experience this issue:Next steps
I've lifted the RAM limit to
1GiB
to see if it helps.The text was updated successfully, but these errors were encountered: