Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[helm-charts] akash-inventory-operator needs more memory #185

Closed
andy108369 opened this issue Feb 20, 2024 · 0 comments · Fixed by akash-network/helm-charts#248
Closed
Assignees
Labels
repo/helm-charts Akash Helm Chart repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Feb 20, 2024

current limit of 512MiB set to akash-inventory-operator might be too small
https://github.com/akash-network/helm-charts/blob/provider-8.0.3/charts/akash-inventory-operator/templates/deployment.yaml#L36

$ kubectl -n akash-services get pods
NAME                                        READY   STATUS    RESTARTS      AGE
akash-hostname-operator-6795445db-jf46g     1/1     Running   0             27m
akash-inventory-operator-75d7758b86-kqk6s   1/1     Running   4 (68s ago)   29m
akash-node-1-0                              1/1     Running   0             48m
akash-provider-0                            1/1     Running   0             26m
root@node3:~# dmesg -T -l alert -l crit -l emerg -l err 
...
[Tue Feb 20 19:10:18 2024] Memory cgroup out of memory: Killed process 80061 (provider-servic) total-vm:5025880kB, anon-rss:517808kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1620kB oom_score_adj:999
[Tue Feb 20 19:17:12 2024] Memory cgroup out of memory: Killed process 125599 (provider-servic) total-vm:5173344kB, anon-rss:516256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1636kB oom_score_adj:999
[Tue Feb 20 19:17:12 2024] Memory cgroup out of memory: Killed process 125701 (provider-servic) total-vm:5173344kB, anon-rss:516256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1636kB oom_score_adj:999
[Tue Feb 20 19:24:38 2024] Memory cgroup out of memory: Killed process 166260 (provider-servic) total-vm:5322344kB, anon-rss:517112kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1660kB oom_score_adj:999
[Tue Feb 20 19:24:38 2024] Memory cgroup out of memory: Killed process 166287 (provider-servic) total-vm:5322344kB, anon-rss:517112kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1660kB oom_score_adj:999

Env

provider: sg.lneq

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                        IMAGE
akash-hostname-operator-6795445db-jf46g     ghcr.io/akash-network/provider:0.4.8
akash-inventory-operator-544c75d855-qs8lh   ghcr.io/akash-network/provider:0.4.8
akash-node-1-0                              ghcr.io/akash-network/node:0.30.0
akash-provider-0                            ghcr.io/akash-network/provider:0.4.8
$ kubectl get nodes -o wide
NAME    STATUS   ROLES           AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node1   Ready    control-plane   71m   v1.28.6   10.74.43.129   <none>        Ubuntu 22.04.4 LTS   6.5.0-18-generic   containerd://1.7.11
node2   Ready    control-plane   71m   v1.28.6   10.74.43.133   <none>        Ubuntu 22.04.4 LTS   6.5.0-18-generic   containerd://1.7.11
node3   Ready    <none>          69m   v1.28.6   10.74.43.131   <none>        Ubuntu 22.04.4 LTS   6.5.0-18-generic   containerd://1.7.11
node4   Ready    <none>          69m   v1.28.6   10.8.68.129    <none>        Ubuntu 22.04.4 LTS   6.5.0-18-generic   containerd://1.7.11
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status
  cluster:
    id:     69d6af8d-3dfa-47cd-8f6e-bcbc5320987f
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 42m)
    mgr: b(active, since 39m), standbys: a
    osd: 8 osds: 8 up (since 40m), 8 in (since 40m)
 
  data:
    pools:   2 pools, 257 pgs
    objects: 7 objects, 577 KiB
    usage:   4.9 GiB used, 28 TiB / 28 TiB avail
    pgs:     257 active+clean

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         27.94470  root default                             
-5          6.98618      host node1                           
 2   nvme   3.49309          osd.2       up   1.00000  1.00000
 5   nvme   3.49309          osd.5       up   1.00000  1.00000
-9          6.98618      host node2                           
 6   nvme   3.49309          osd.6       up   1.00000  1.00000
 7   nvme   3.49309          osd.7       up   1.00000  1.00000
-3          6.98618      host node3                           
 1   nvme   3.49309          osd.1       up   1.00000  1.00000
 4   nvme   3.49309          osd.4       up   1.00000  1.00000
-7          6.98618      host node4                           
 0   nvme   3.49309          osd.0       up   1.00000  1.00000
 3   nvme   3.49309          osd.3       up   1.00000  1.00000

Observation: 3 vs 4 nodes

Interestingly, comparing this provider sg.lneq to sg.lnlm, the latter doesn't experience this issue:

$ kubectl -n akash-services get pods
NAME                                        READY   STATUS    RESTARTS   AGE
akash-hostname-operator-6795445db-5xhq5     1/1     Running   0          5d7h
akash-inventory-operator-75d7758b86-gh2wj   1/1     Running   0          5d6h
akash-node-1-0                              1/1     Running   0          5d7h
akash-provider-0                            1/1     Running   0          21h

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status
  cluster:
    id:     661a3fe0-5ff2-4575-a421-f812501f463c
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 5d)
    mgr: a(active, since 5d), standbys: b
    osd: 6 osds: 6 up (since 5d), 6 in (since 5d)
 
  data:
    pools:   2 pools, 257 pgs
    objects: 491 objects, 889 MiB
    usage:   9.8 GiB used, 5.2 TiB / 5.2 TiB avail
    pgs:     257 active+clean
 
  io:
    client:   341 B/s wr, 0 op/s rd, 0 op/s wr
 
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         5.23975  root default                             
-3         1.74658      host node1                           
 0   nvme  0.87329          osd.0       up   1.00000  1.00000
 2   nvme  0.87329          osd.2       up   1.00000  1.00000
-5         1.74658      host node2                           
 1   nvme  0.87329          osd.1       up   1.00000  1.00000
 4   nvme  0.87329          osd.4       up   1.00000  1.00000
-7         1.74658      host node3                           
 3   nvme  0.87329          osd.3       up   1.00000  1.00000
 5   nvme  0.87329          osd.5       up   1.00000  1.00000

$ kubectl get nodes -o wide
NAME    STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
node1   Ready    control-plane   5d8h   v1.28.6   192.168.0.100   <none>        Ubuntu 22.04.3 LTS   5.15.0-94-generic   containerd://1.7.11
node2   Ready    control-plane   5d8h   v1.28.6   192.168.0.101   <none>        Ubuntu 22.04.3 LTS   5.15.0-94-generic   containerd://1.7.11
node3   Ready    <none>          5d8h   v1.28.6   192.168.0.102   <none>        Ubuntu 22.04.3 LTS   5.15.0-94-generic   containerd://1.7.11

Next steps

I've lifted the RAM limit to 1GiB to see if it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/helm-charts Akash Helm Chart repo issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant