Description
current limit of 512MiB
set to akash-inventory-operator
might be too small
https://github.com/akash-network/helm-charts/blob/provider-8.0.3/charts/akash-inventory-operator/templates/deployment.yaml#L36
$ kubectl -n akash-services get pods
NAME READY STATUS RESTARTS AGE
akash-hostname-operator-6795445db-jf46g 1/1 Running 0 27m
akash-inventory-operator-75d7758b86-kqk6s 1/1 Running 4 (68s ago) 29m
akash-node-1-0 1/1 Running 0 48m
akash-provider-0 1/1 Running 0 26m
root@node3:~# dmesg -T -l alert -l crit -l emerg -l err
...
[Tue Feb 20 19:10:18 2024] Memory cgroup out of memory: Killed process 80061 (provider-servic) total-vm:5025880kB, anon-rss:517808kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1620kB oom_score_adj:999
[Tue Feb 20 19:17:12 2024] Memory cgroup out of memory: Killed process 125599 (provider-servic) total-vm:5173344kB, anon-rss:516256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1636kB oom_score_adj:999
[Tue Feb 20 19:17:12 2024] Memory cgroup out of memory: Killed process 125701 (provider-servic) total-vm:5173344kB, anon-rss:516256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1636kB oom_score_adj:999
[Tue Feb 20 19:24:38 2024] Memory cgroup out of memory: Killed process 166260 (provider-servic) total-vm:5322344kB, anon-rss:517112kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1660kB oom_score_adj:999
[Tue Feb 20 19:24:38 2024] Memory cgroup out of memory: Killed process 166287 (provider-servic) total-vm:5322344kB, anon-rss:517112kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1660kB oom_score_adj:999
Env
provider:
sg.lneq
$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME IMAGE
akash-hostname-operator-6795445db-jf46g ghcr.io/akash-network/provider:0.4.8
akash-inventory-operator-544c75d855-qs8lh ghcr.io/akash-network/provider:0.4.8
akash-node-1-0 ghcr.io/akash-network/node:0.30.0
akash-provider-0 ghcr.io/akash-network/provider:0.4.8
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready control-plane 71m v1.28.6 10.74.43.129 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11
node2 Ready control-plane 71m v1.28.6 10.74.43.133 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11
node3 Ready <none> 69m v1.28.6 10.74.43.131 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11
node4 Ready <none> 69m v1.28.6 10.8.68.129 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status
cluster:
id: 69d6af8d-3dfa-47cd-8f6e-bcbc5320987f
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 42m)
mgr: b(active, since 39m), standbys: a
osd: 8 osds: 8 up (since 40m), 8 in (since 40m)
data:
pools: 2 pools, 257 pgs
objects: 7 objects, 577 KiB
usage: 4.9 GiB used, 28 TiB / 28 TiB avail
pgs: 257 active+clean
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 27.94470 root default
-5 6.98618 host node1
2 nvme 3.49309 osd.2 up 1.00000 1.00000
5 nvme 3.49309 osd.5 up 1.00000 1.00000
-9 6.98618 host node2
6 nvme 3.49309 osd.6 up 1.00000 1.00000
7 nvme 3.49309 osd.7 up 1.00000 1.00000
-3 6.98618 host node3
1 nvme 3.49309 osd.1 up 1.00000 1.00000
4 nvme 3.49309 osd.4 up 1.00000 1.00000
-7 6.98618 host node4
0 nvme 3.49309 osd.0 up 1.00000 1.00000
3 nvme 3.49309 osd.3 up 1.00000 1.00000
Observation: 3 vs 4 nodes
Interestingly, comparing this provider sg.lneq
to sg.lnlm
, the latter doesn't experience this issue:
$ kubectl -n akash-services get pods
NAME READY STATUS RESTARTS AGE
akash-hostname-operator-6795445db-5xhq5 1/1 Running 0 5d7h
akash-inventory-operator-75d7758b86-gh2wj 1/1 Running 0 5d6h
akash-node-1-0 1/1 Running 0 5d7h
akash-provider-0 1/1 Running 0 21h
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status
cluster:
id: 661a3fe0-5ff2-4575-a421-f812501f463c
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 5d)
mgr: a(active, since 5d), standbys: b
osd: 6 osds: 6 up (since 5d), 6 in (since 5d)
data:
pools: 2 pools, 257 pgs
objects: 491 objects, 889 MiB
usage: 9.8 GiB used, 5.2 TiB / 5.2 TiB avail
pgs: 257 active+clean
io:
client: 341 B/s wr, 0 op/s rd, 0 op/s wr
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5.23975 root default
-3 1.74658 host node1
0 nvme 0.87329 osd.0 up 1.00000 1.00000
2 nvme 0.87329 osd.2 up 1.00000 1.00000
-5 1.74658 host node2
1 nvme 0.87329 osd.1 up 1.00000 1.00000
4 nvme 0.87329 osd.4 up 1.00000 1.00000
-7 1.74658 host node3
3 nvme 0.87329 osd.3 up 1.00000 1.00000
5 nvme 0.87329 osd.5 up 1.00000 1.00000
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready control-plane 5d8h v1.28.6 192.168.0.100 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.7.11
node2 Ready control-plane 5d8h v1.28.6 192.168.0.101 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.7.11
node3 Ready <none> 5d8h v1.28.6 192.168.0.102 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.7.11
Next steps
I've lifted the RAM limit to 1GiB
to see if it helps.