-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect memory kills in AWS Build Jobs #46447
base: develop
Are you sure you want to change the base?
Conversation
@alecbcs what's the smallest change I can make to a package to trigger a rebuild in one of the stacks? |
I'd bump the version of a package which has no dependents. E.g. nano in the
|
This might also need to rebase on to of #46445 to get CI to pass. |
@spackbot run pipeline |
I've started that pipeline for you! |
@spackbot run pipeline |
I've started that pipeline for you! |
Closes spack/spack-gantry#117 This PR is motivated by the fact that we will be implementing memory limits into CI at some point, and we want a robust and stable way of detecting if we are killing jobs due to memory constraints. There is no current way to detect this in k8s/prometheus in out environment. For example, this job was [OOM killed](https://gitlab.spack.io/spack/spack/-/jobs/12730664), yet the information reported to prometheus/opensearch/etc does not suggest a reason. I came across a [blog post](https://engineering.outschool.com/posts/gitlab-runner-on-kubernetes/#out-of-memory-detection) that describes the same issue, which boils down to the fact k8s can only detect OOM kills for pid=1. In the build containers, the gitlab runner itself is pid 1, where the script steps are spawned independently. This is something that has changed with cgroups v2, [which checks for OOM kills in all processes](https://itnext.io/kubernetes-silent-pod-killer-104e7c8054d9). However, many of our [runner containers](https://github.com/spack/gitlab-runners/tree/main/Dockerfiles) are using OS versions outside the [support matrix](https://kubernetes.io/docs/concepts/architecture/cgroups/#requirements) for this feature. The author of the blog post I mentioned pushed [a feature](https://gitlab.com/outschool-eng/gitlab-runner/-/commit/65d5c4d468ffdbde0ceeafd9168d1326bae8e708) to his fork of gitlab runner that checks for OOM using kernel messages after job failure. I adapted this to a call in `after_script`, which relies upon permission to run `dmesg`. The benefit of `after_script` is that it's executed regardless of exit reason, unless the runner dies or times out. If an OOM is detected, it's output to the trace and a file is written to `jobs_scratch_dir/user_data/oom-info`, which can be accessed by a client like: ``` GET https://gitlab.spack.io/api/v4/projects/:id/jobs/:job_id/artifacts/jobs_scratch_dir/user_data/oom-info ``` I attempted to have this propagated as a pod/annotation label to no avail, and other methods of sending this to prometheus would be far too complex. I've tested it in the staging cluster by setting artificially low limits, check out [this pipeline](https://gitlab.staging.spack.io/spack/spack/-/pipelines/1256).
For what is worth, for a different project where we're also plagued by jobs mysteriously killed probably due to OOM where observability is problematic after the fact, I recently started using earlyoom, which is more tunable and somewhat more reliable than Linux own OOM-killer, and you can much more easily inspect if and why a process was killed because it was using too much memory. |
Thanks for the tip. I could be proven wrong, but the oom killer running inside of our build containers does a solid job of terminating processes going over the The setup of gitlab runners + containers definitely helps with isolating the kernel messages and processes. I have seen cases where a concretization pool with a low memory limit will hang and time out after an hour rather than getting killed, so earlyoom would definitely be useful for those scenarios. |
Did some testing with Pros:
Cons:
The first con is somewhat surmountable, but we would be playing a game of whack-a-mole if anything changes with the way gitlab-runner sets up child processes. The second would require additional logic to convert memory thresholds for each job.
While the initial implementation is not the "smoothest," it consistently detects OOM kills and works given the limitations outlined. note to self: exiting with code |
Occasionally, a build job is generated, then it finds that there is an available binary in the cache. Other than checking the job trace (which can be many MB in size), we should write this info to a small file.
Add clarifying comment about the isolation of the check wrt the node and write to generic `job_metadata.txt` file
@scottwittenburg @kwryankrattiger @alecbcs
@mvandenburgh you might be interested in this for error taxonomy
Closes spack/spack-gantry#117
This PR is motivated by the fact that we will be implementing memory limits into CI at some point, and we want a robust and stable way of detecting if we are killing jobs due to memory constraints.
There is no current way to detect this in k8s/prometheus in out environment.
For example, this job was OOM killed, yet the information reported to prometheus/opensearch/etc does not suggest a reason.
I came across a blog post that describes the same issue, which boils down to the fact k8s can only detect OOM kills for pid=1. In the build containers, the gitlab runner itself is pid 1, where the script steps are spawned independently.
This is something that has changed with cgroups v2, which checks for OOM kills in all processes. However, many of our runner containers are using OS versions outside the support matrix for this feature.
The author of the blog post I mentioned pushed a feature to his fork of gitlab runner that checks for OOM using kernel messages after job failure.
I adapted this to a call in
after_script
, which relies upon permission to rundmesg
.The benefit of
after_script
is that it's executed regardless of exit reason, unless the runner dies or times out.If an OOM is detected, it's output to the trace and a file is written to
jobs_scratch_dir/user_data/oom-info
, which can be accessed by a client like:I attempted to have this propagated as a pod/annotation label to no avail, and other methods of sending this to prometheus would be far too complex.
I've tested it in the staging cluster by setting artificially low limits, check out this pipeline.
job links:
Prometheus query for
kube_pod_container_status_last_terminated_reason
. noOOMKilled
result presentquerying
container_oom_events_total
is a bit better, but it doesn't consistently work. for instance, search for the podrunner-i5axk223-project-8-concurrent-0-9rzl3xc0
for this job (which was OOM killed), but it isn't registered as an "OOM event" for some reason