Detect memory kills in AWS Build Jobs #46447

cmelone · 2024-09-17T20:42:31Z

@scottwittenburg @kwryankrattiger @alecbcs

@mvandenburgh you might be interested in this for error taxonomy

This PR is motivated by the fact that we will be implementing memory limits into CI at some point, and we want a robust and stable way of detecting if we are killing jobs due to memory constraints.

There is no current way to detect this in k8s/prometheus in out environment.

For example, this job was OOM killed, yet the information reported to prometheus/opensearch/etc does not suggest a reason.

I came across a blog post that describes the same issue, which boils down to the fact k8s can only detect OOM kills for pid=1. In the build containers, the gitlab runner itself is pid 1, where the script steps are spawned independently.

This is something that has changed with cgroups v2, which checks for OOM kills in all processes. However, many of our runner containers are using OS versions outside the support matrix for this feature.

The author of the blog post I mentioned pushed a feature to his fork of gitlab runner that checks for OOM using kernel messages after job failure.

I adapted this to a call in after_script, which relies upon permission to run dmesg.

The benefit of after_script is that it's executed regardless of exit reason, unless the runner dies or times out.

If an OOM is detected, it's output to the trace and a file is written to jobs_scratch_dir/user_data/oom-info, which can be accessed by a client like:

GET https://gitlab.spack.io/api/v4/projects/:id/jobs/:job_id/artifacts/jobs_scratch_dir/user_data/oom-info

I attempted to have this propagated as a pod/annotation label to no avail, and other methods of sending this to prometheus would be far too complex.

I've tested it in the staging cluster by setting artificially low limits, check out this pipeline.

job links:

Prometheus query for kube_pod_container_status_last_terminated_reason. no OOMKilled result present

querying container_oom_events_total is a bit better, but it doesn't consistently work. for instance, search for the pod runner-i5axk223-project-8-concurrent-0-9rzl3xc0 for this job (which was OOM killed), but it isn't registered as an "OOM event" for some reason

cmelone · 2024-09-17T21:52:09Z

@alecbcs what's the smallest change I can make to a package to trigger a rebuild in one of the stacks?

alecbcs · 2024-09-17T22:12:07Z

I'd bump the version of a package which has no dependents. E.g. nano in the developer-tools stack.

> spack checksum nano 8.2
==> Found 1 version of nano
==> Fetching https://www.nano-editor.org/dist/v8/nano-8.2.tar.xz

    version("8.2", sha256="d5ad07dd862facae03051c54c6535e54c7ed7407318783fcad1ad2d7076fffeb")

alecbcs · 2024-09-17T22:14:15Z

This might also need to rebase on to of #46445 to get CI to pass.

cmelone · 2024-09-18T16:26:53Z

@spackbot run pipeline

spackbot-app · 2024-09-18T16:26:58Z

I've started that pipeline for you!

cmelone · 2024-09-18T16:36:56Z

@spackbot run pipeline

spackbot-app · 2024-09-18T16:37:02Z

I've started that pipeline for you!

Closes spack/spack-gantry#117 This PR is motivated by the fact that we will be implementing memory limits into CI at some point, and we want a robust and stable way of detecting if we are killing jobs due to memory constraints. There is no current way to detect this in k8s/prometheus in out environment. For example, this job was [OOM killed](https://gitlab.spack.io/spack/spack/-/jobs/12730664), yet the information reported to prometheus/opensearch/etc does not suggest a reason. I came across a [blog post](https://engineering.outschool.com/posts/gitlab-runner-on-kubernetes/#out-of-memory-detection) that describes the same issue, which boils down to the fact k8s can only detect OOM kills for pid=1. In the build containers, the gitlab runner itself is pid 1, where the script steps are spawned independently. This is something that has changed with cgroups v2, [which checks for OOM kills in all processes](https://itnext.io/kubernetes-silent-pod-killer-104e7c8054d9). However, many of our [runner containers](https://github.com/spack/gitlab-runners/tree/main/Dockerfiles) are using OS versions outside the [support matrix](https://kubernetes.io/docs/concepts/architecture/cgroups/#requirements) for this feature. The author of the blog post I mentioned pushed [a feature](https://gitlab.com/outschool-eng/gitlab-runner/-/commit/65d5c4d468ffdbde0ceeafd9168d1326bae8e708) to his fork of gitlab runner that checks for OOM using kernel messages after job failure. I adapted this to a call in `after_script`, which relies upon permission to run `dmesg`. The benefit of `after_script` is that it's executed regardless of exit reason, unless the runner dies or times out. If an OOM is detected, it's output to the trace and a file is written to `jobs_scratch_dir/user_data/oom-info`, which can be accessed by a client like: ``` GET https://gitlab.spack.io/api/v4/projects/:id/jobs/:job_id/artifacts/jobs_scratch_dir/user_data/oom-info ``` I attempted to have this propagated as a pod/annotation label to no avail, and other methods of sending this to prometheus would be far too complex. I've tested it in the staging cluster by setting artificially low limits, check out [this pipeline](https://gitlab.staging.spack.io/spack/spack/-/pipelines/1256).

giordano · 2024-10-24T22:21:27Z

For what is worth, for a different project where we're also plagued by jobs mysteriously killed probably due to OOM where observability is problematic after the fact, I recently started using earlyoom, which is more tunable and somewhat more reliable than Linux own OOM-killer, and you can much more easily inspect if and why a process was killed because it was using too much memory.

cmelone · 2024-10-24T22:48:16Z

Thanks for the tip. I could be proven wrong, but the oom killer running inside of our build containers does a solid job of terminating processes going over the KUBERNETES_MEMORY_LIMIT.

The setup of gitlab runners + containers definitely helps with isolating the kernel messages and processes.

I have seen cases where a concretization pool with a low memory limit will hang and time out after an hour rather than getting killed, so earlyoom would definitely be useful for those scenarios.

cmelone · 2024-10-28T15:14:46Z

Did some testing with earlyoom following feedback on Friday

Pros:

sends proper termination signal (picked up by prometheus) if process is killed during script step

Cons:

if memory contention occurs during the before_script, it kills the main gitlab runner process, leading to a hang (example). This OOM does not get picked up by prometheus.
gets memory info from /proc/meminfo, meaning it isn't cgroup aware and we would need to configure it with KUBERNETES_MEMORY_LIMIT

The first con is somewhat surmountable, but we would be playing a game of whack-a-mole if anything changes with the way gitlab-runner sets up child processes. The second would require additional logic to convert memory thresholds for each job.

nohang and oomd are alternatives, but require cgroupsv2, which many of our images do not support.

While the initial implementation is not the "smoothest," it consistently detects OOM kills and works given the limitations outlined.

note to self: exiting with code 137 does not get picked up in prometheus (example)

Occasionally, a build job is generated, then it finds that there is an available binary in the cache. Other than checking the job trace (which can be many MB in size), we should write this info to a small file.

Add clarifying comment about the isolation of the check wrt the node and write to generic `job_metadata.txt` file

alecbcs assigned alecbcs and cmelone Sep 17, 2024

spackbot-app bot added core PR affects Spack core functionality gitlab Issues related to gitlab integration update-package labels Sep 17, 2024

spackbot-app bot added the new-version label Sep 18, 2024

cmelone force-pushed the oom branch from 1639ca3 to 7e92a79 Compare September 30, 2024 15:13

cmelone force-pushed the oom branch from 2fd9e6b to 9ebc8c9 Compare October 11, 2024 16:03

spackbot-app bot added the shell-support label Oct 14, 2024

cmelone changed the title ~~test CI OOM behavior~~ Detect memory kills in AWS Build Jobs Oct 24, 2024

cmelone force-pushed the oom branch from cebae13 to 2e8aaa8 Compare October 24, 2024 22:17

cmelone marked this pull request as draft October 25, 2024 18:29

cmelone marked this pull request as ready for review October 28, 2024 15:20

cmelone closed this Oct 29, 2024

cmelone reopened this Oct 29, 2024

cmelone marked this pull request as draft October 29, 2024 15:37

cmelone marked this pull request as ready for review October 29, 2024 15:37

cmelone added 2 commits October 29, 2024 18:55

Store whether job was built in job_metadata.txt

70c9835

Occasionally, a build job is generated, then it finds that there is an available binary in the cache. Other than checking the job trace (which can be many MB in size), we should write this info to a small file.

Update oom-check.sh

bd117d6

Add clarifying comment about the isolation of the check wrt the node and write to generic `job_metadata.txt` file

spackbot-app bot added the commands label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect memory kills in AWS Build Jobs #46447

Detect memory kills in AWS Build Jobs #46447

cmelone commented Sep 17, 2024 •

edited

Loading

cmelone commented Sep 17, 2024

alecbcs commented Sep 17, 2024

alecbcs commented Sep 17, 2024

cmelone commented Sep 18, 2024

spackbot-app bot commented Sep 18, 2024

cmelone commented Sep 18, 2024

spackbot-app bot commented Sep 18, 2024

giordano commented Oct 24, 2024

cmelone commented Oct 24, 2024 •

edited

Loading

cmelone commented Oct 28, 2024 •

edited

Loading

Detect memory kills in AWS Build Jobs #46447

Are you sure you want to change the base?

Detect memory kills in AWS Build Jobs #46447

Conversation

cmelone commented Sep 17, 2024 • edited Loading

cmelone commented Sep 17, 2024

alecbcs commented Sep 17, 2024

alecbcs commented Sep 17, 2024

cmelone commented Sep 18, 2024

spackbot-app bot commented Sep 18, 2024

cmelone commented Sep 18, 2024

spackbot-app bot commented Sep 18, 2024

giordano commented Oct 24, 2024

cmelone commented Oct 24, 2024 • edited Loading

cmelone commented Oct 28, 2024 • edited Loading

cmelone commented Sep 17, 2024 •

edited

Loading

cmelone commented Oct 24, 2024 •

edited

Loading

cmelone commented Oct 28, 2024 •

edited

Loading