Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect memory kills in AWS Build Jobs #46447

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open

Conversation

cmelone
Copy link
Contributor

@cmelone cmelone commented Sep 17, 2024

@scottwittenburg @kwryankrattiger @alecbcs

@mvandenburgh you might be interested in this for error taxonomy

Closes spack/spack-gantry#117

This PR is motivated by the fact that we will be implementing memory limits into CI at some point, and we want a robust and stable way of detecting if we are killing jobs due to memory constraints.

There is no current way to detect this in k8s/prometheus in out environment.

For example, this job was OOM killed, yet the information reported to prometheus/opensearch/etc does not suggest a reason.

I came across a blog post that describes the same issue, which boils down to the fact k8s can only detect OOM kills for pid=1. In the build containers, the gitlab runner itself is pid 1, where the script steps are spawned independently.

This is something that has changed with cgroups v2, which checks for OOM kills in all processes. However, many of our runner containers are using OS versions outside the support matrix for this feature.

The author of the blog post I mentioned pushed a feature to his fork of gitlab runner that checks for OOM using kernel messages after job failure.

I adapted this to a call in after_script, which relies upon permission to run dmesg.

The benefit of after_script is that it's executed regardless of exit reason, unless the runner dies or times out.

If an OOM is detected, it's output to the trace and a file is written to jobs_scratch_dir/user_data/oom-info, which can be accessed by a client like:

GET https://gitlab.spack.io/api/v4/projects/:id/jobs/:job_id/artifacts/jobs_scratch_dir/user_data/oom-info

I attempted to have this propagated as a pod/annotation label to no avail, and other methods of sending this to prometheus would be far too complex.

I've tested it in the staging cluster by setting artificially low limits, check out this pipeline.

job links:

Prometheus query for kube_pod_container_status_last_terminated_reason. no OOMKilled result present

querying container_oom_events_total is a bit better, but it doesn't consistently work. for instance, search for the pod runner-i5axk223-project-8-concurrent-0-9rzl3xc0 for this job (which was OOM killed), but it isn't registered as an "OOM event" for some reason

@spackbot-app spackbot-app bot added core PR affects Spack core functionality gitlab Issues related to gitlab integration update-package labels Sep 17, 2024
@cmelone
Copy link
Contributor Author

cmelone commented Sep 17, 2024

@alecbcs what's the smallest change I can make to a package to trigger a rebuild in one of the stacks?

@alecbcs
Copy link
Member

alecbcs commented Sep 17, 2024

I'd bump the version of a package which has no dependents. E.g. nano in the developer-tools stack.

> spack checksum nano 8.2
==> Found 1 version of nano
==> Fetching https://www.nano-editor.org/dist/v8/nano-8.2.tar.xz

    version("8.2", sha256="d5ad07dd862facae03051c54c6535e54c7ed7407318783fcad1ad2d7076fffeb")

@alecbcs
Copy link
Member

alecbcs commented Sep 17, 2024

This might also need to rebase on to of #46445 to get CI to pass.

@cmelone
Copy link
Contributor Author

cmelone commented Sep 18, 2024

@spackbot run pipeline

Copy link

spackbot-app bot commented Sep 18, 2024

I've started that pipeline for you!

@cmelone
Copy link
Contributor Author

cmelone commented Sep 18, 2024

@spackbot run pipeline

Copy link

spackbot-app bot commented Sep 18, 2024

I've started that pipeline for you!

@cmelone cmelone changed the title test CI OOM behavior Detect memory kills in AWS Build Jobs Oct 24, 2024
Closes spack/spack-gantry#117

This PR is motivated by the fact that we will be implementing memory limits into CI at some point, and we want a robust and stable way of detecting if we are killing jobs due to memory constraints.

There is no current way to detect this in k8s/prometheus in out environment.

For example, this job was [OOM killed](https://gitlab.spack.io/spack/spack/-/jobs/12730664), yet the information reported to prometheus/opensearch/etc does not suggest a reason.

I came across a [blog post](https://engineering.outschool.com/posts/gitlab-runner-on-kubernetes/#out-of-memory-detection) that describes the same issue, which boils down to the fact k8s can only detect OOM kills for pid=1. In the build containers, the gitlab runner itself is pid 1, where the script steps are spawned independently.

This is something that has changed with cgroups v2, [which checks for OOM kills in all processes](https://itnext.io/kubernetes-silent-pod-killer-104e7c8054d9). However, many of our [runner containers](https://github.com/spack/gitlab-runners/tree/main/Dockerfiles) are using OS versions outside the [support matrix](https://kubernetes.io/docs/concepts/architecture/cgroups/#requirements) for this feature.

The author of the blog post I mentioned pushed [a feature](https://gitlab.com/outschool-eng/gitlab-runner/-/commit/65d5c4d468ffdbde0ceeafd9168d1326bae8e708) to his fork of gitlab runner that checks for OOM using kernel messages after job failure.

I adapted this to a call in `after_script`, which relies upon permission to run `dmesg`.

The benefit of `after_script` is that it's executed regardless of exit reason, unless the runner dies or times out.

If an OOM is detected, it's output to the trace and a file is written to `jobs_scratch_dir/user_data/oom-info`, which can be accessed by a client like:

```
GET https://gitlab.spack.io/api/v4/projects/:id/jobs/:job_id/artifacts/jobs_scratch_dir/user_data/oom-info
```

I attempted to have this propagated as a pod/annotation label to no avail, and other methods of sending this to prometheus would be far too complex.

I've tested it in the staging cluster by setting artificially low limits, check out [this pipeline](https://gitlab.staging.spack.io/spack/spack/-/pipelines/1256).
@giordano
Copy link
Member

For what is worth, for a different project where we're also plagued by jobs mysteriously killed probably due to OOM where observability is problematic after the fact, I recently started using earlyoom, which is more tunable and somewhat more reliable than Linux own OOM-killer, and you can much more easily inspect if and why a process was killed because it was using too much memory.

@cmelone
Copy link
Contributor Author

cmelone commented Oct 24, 2024

Thanks for the tip. I could be proven wrong, but the oom killer running inside of our build containers does a solid job of terminating processes going over the KUBERNETES_MEMORY_LIMIT.

The setup of gitlab runners + containers definitely helps with isolating the kernel messages and processes.

I have seen cases where a concretization pool with a low memory limit will hang and time out after an hour rather than getting killed, so earlyoom would definitely be useful for those scenarios.

@cmelone cmelone marked this pull request as draft October 25, 2024 18:29
@cmelone
Copy link
Contributor Author

cmelone commented Oct 28, 2024

Did some testing with earlyoom following feedback on Friday

Pros:

  • sends proper termination signal (picked up by prometheus) if process is killed during script step

Cons:

  • if memory contention occurs during the before_script, it kills the main gitlab runner process, leading to a hang (example). This OOM does not get picked up by prometheus.
  • gets memory info from /proc/meminfo, meaning it isn't cgroup aware and we would need to configure it with KUBERNETES_MEMORY_LIMIT

The first con is somewhat surmountable, but we would be playing a game of whack-a-mole if anything changes with the way gitlab-runner sets up child processes. The second would require additional logic to convert memory thresholds for each job.

nohang and oomd are alternatives, but require cgroupsv2, which many of our images do not support.

While the initial implementation is not the "smoothest," it consistently detects OOM kills and works given the limitations outlined.


note to self: exiting with code 137 does not get picked up in prometheus (example)

@cmelone cmelone marked this pull request as ready for review October 28, 2024 15:20
@cmelone cmelone closed this Oct 29, 2024
@cmelone cmelone reopened this Oct 29, 2024
@cmelone cmelone marked this pull request as draft October 29, 2024 15:37
@cmelone cmelone marked this pull request as ready for review October 29, 2024 15:37
Occasionally, a build job is generated, then it finds that there is an available binary in the cache. Other than checking the job trace (which can be many MB in size), we should write this info to a small file.
Add clarifying comment about the isolation of the check wrt the node and write to generic `job_metadata.txt` file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
commands core PR affects Spack core functionality gitlab Issues related to gitlab integration new-version shell-support update-package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OOM Detection
3 participants