Detect memory kills in AWS Build Jobs

Closes spack/spack-gantry#117 This PR is motivated by the fact that we will be implementing memory limits into CI at some point, and we want a robust and stable way of detecting if we are killing jobs due to memory constraints. There is no current way to detect this in k8s/prometheus in out environment. For example, this job was [OOM killed](https://gitlab.spack.io/spack/spack/-/jobs/12730664), yet the information reported to prometheus/opensearch/etc does not suggest a reason. I came across a [blog post](https://engineering.outschool.com/posts/gitlab-runner-on-kubernetes/#out-of-memory-detection) that describes the same issue, which boils down to the fact k8s can only detect OOM kills for pid=1. In the build containers, the gitlab runner itself is pid 1, where the script steps are spawned independently. This is something that has changed with cgroups v2, [which checks for OOM kills in all processes](https://itnext.io/kubernetes-silent-pod-killer-104e7c8054d9). However, many of our [runner containers](https://github.com/spack/gitlab-runners/tree/main/Dockerfiles) are using OS versions outside the [support matrix](https://kubernetes.io/docs/concepts/architecture/cgroups/#requirements) for this feature. The author of the blog post I mentioned pushed [a feature](https://gitlab.com/outschool-eng/gitlab-runner/-/commit/65d5c4d468ffdbde0ceeafd9168d1326bae8e708) to his fork of gitlab runner that checks for OOM using kernel messages after job failure. I adapted this to a call in `after_script`, which relies upon permission to run `dmesg`. The benefit of `after_script` is that it's executed regardless of exit reason, unless the runner dies or times out. If an OOM is detected, it's output to the trace and a file is written to `jobs_scratch_dir/user_data/oom-info`, which can be accessed by a client like: ``` GET https://gitlab.spack.io/api/v4/projects/:id/jobs/:job_id/artifacts/jobs_scratch_dir/user_data/oom-info ``` I attempted to have this propagated as a pod/annotation label to no avail, and other methods of sending this to prometheus would be far too complex. I've tested it in the staging cluster by setting artificially low limits, check out [this pipeline](https://gitlab.staging.spack.io/spack/spack/-/pipelines/1256).
cmelone · Oct 24, 2024 · 2e8aaa8 · 2e8aaa8
1 parent f09ce00
commit 2e8aaa8
Show file tree

Hide file tree

Showing 2 changed files with 48 additions and 0 deletions.
diff --git a/share/spack/gitlab/cloud_pipelines/configs/ci.yaml b/share/spack/gitlab/cloud_pipelines/configs/ci.yaml
@@ -32,6 +32,8 @@ ci:
           --prefix /home/software/spack:${CI_PROJECT_DIR}/opt/spack
           --log install_times.json
           ${SPACK_ARTIFACTS_ROOT}/user_data/install_times.json || true
+      - - . ${CI_PROJECT_DIR}/share/spack/gitlab/cloud_pipelines/scripts/common/oom-check.sh || true
+
       variables:
         CI_JOB_SIZE: "default"
         CI_GPG_KEY_ROOT: /mnt/key

diff --git a/share/spack/gitlab/cloud_pipelines/scripts/common/oom-check.sh b/share/spack/gitlab/cloud_pipelines/scripts/common/oom-check.sh
@@ -0,0 +1,46 @@
+# Copyright 2013-2024 Lawrence Livermore National Security, LLC and other
+# Spack Project Developers. See the top-level COPYRIGHT file for details.
+#
+# SPDX-License-Identifier: (Apache-2.0 OR MIT)
+
+#!/bin/bash
+
+set -e
+
+# this script was designed after this commit to a gitlab-runner fork (MIT)
+# https://gitlab.com/outschool-eng/gitlab-runner/-/commit/65d5c4d468ffdbde0ceeafd9168d1326bae8e708
+# we rely upon the ability to view kernel messages in the build container
+
+SPACK_ARTIFACTS_ROOT=${CI_PROJECT_DIR}/jobs_scratch_dir
+mkdir -p ${SPACK_ARTIFACTS_ROOT}/user_data
+
+# exit early if job was successful or not on AWS
+[[ "$CI_JOB_STATUS" != "failed" ]] && exit 0
+[[ "$CI_RUNNER_TAGS" != *"aws"* ]] && exit 0
+
+# ensure /proc/1/cgroup exists
+if [[ ! -f /proc/1/cgroup ]]; then
+  echo "Error: /proc/1/cgroup not found"
+  exit 1
+fi
+
+OOM_MESSAGE="This job was killed due to memory constraints. Report to #ci-and-pipelines in Slack if you need help."
+
+# Look for an OOM kill in the build container cpuset or any other container with the pod level memory in path
+# the dmesg line will look like:
+# [ 1578.430541] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=657a5777a8dbad52481bde927e9464ce5a838ad75f14ddf4322a32104786bce2,mems_allowed=0,oom_memcg=/kubepods/burstable/pod53bff6f9-f52d-418b-abf1-b5df128eb9cd/657a5777a8dbad52481bde927e9464ce5a838ad75f14ddf4322a32104786bce2,task_memcg=/kubepods/burstable/pod53bff6f9-f52d-418b-abf1-b5df128eb9cd/657a5777a8dbad52481bde927e9464ce5a838ad75f14ddf4322a32104786bce2,task=sh,pid=30361,uid=0
+# where the last chunk of cgroup would be 657a5777a8dbad52481bde927e9464ce5a838ad75f14ddf4322a32104786bce2
+# and the pod level memory config dir would be pod53bff6f9-f52d-418b-abf1-b5df128eb9cd, sourced from the second to last chunk
+
+proc1_cgroup=$(cat /proc/1/cgroup)
+ctr_cgroup=$(echo "$proc1_cgroup" | tr / '\n' | tail -1 | tr -d '[:space:]')
+pod_cgroup=$(echo "$proc1_cgroup" | tr / '\n' | tail -2 | head -1 | tr -d '[:space:]')
+dmesg_out=$(dmesg)
+
+if echo "$dmesg_out" | grep -q "oom-kill.*$ctr_cgroup"; then
+  echo $OOM_MESSAGE
+  echo "OOM info: container" > ${SPACK_ARTIFACTS_ROOT}/user_data/oom-info
+elif echo "$dmesg_out" | grep -q "oom-kill.*$pod_cgroup"; then
+  echo $OOM_MESSAGE
+  echo "OOM info: pod" > ${SPACK_ARTIFACTS_ROOT}/user_data/oom-info
+fi