-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
worker agent on linux might be getting oomkilled when it's the job's fault instead #64
Comments
It does not, but I would be happy to make use of whatever cgroup/oomkiller APIs make sense for a control agent that should absolutely not die! |
@iliana I think what I would like to have is:
Is there a good API for listening to OOM kill events or am I going to have to tail a log or have a journalctl child or something 😅 |
I have never looked into this beyond the small bit of log message that lives in my head where OpenSSH tells you it's setting it's oom_score_adj at startup. (In the case of OpenSSH, obviously it's doing something to make sure the shells it spawns as users that are logging in are not inheriting that oom_score_adj.) I assume the "right" way to do this would be for the agent to create a new cgroup and run the program inside the cgroup, configuring it to be the first to go when the RAM runs out. What is actually done beyond that point to understand why the process was killed is not something I'm immediately aware of. There's also systemd-oomd, I'm not sure how recent it is (relevant for the Ubuntu images older than 24.04), but it apparently exists because the Linux kernel's oomkiller logic leaves a lot to be desired and can't really dynamically take things into account beyond the one knob of oom_score_adj; one knob does not a policy make. |
I think I'm seeing a
worker agent experienced a fatal error; aborting job
error on this run because the worker agent is getting oomkilled on Linux:https://buildomat.eng.oxide.computer/wg/0/details/01JA5Z0YAABH97EZSWA21ZYMM7/dOyWW4nBzXdj1VMvNjVicWxjHQqHQlHElpGIaVzF4AojHD2t/01JA5Z1D0C5BGA3FKYBRVB60Q9
This is occurring after
TestLint/TestErrCheck
, which runs a command that completely exhausts the 32 GB of RAM on a machine I'm debugging this on. I wonder if the entire cgroup is getting axed and not just the underlying job process.Does the agent set its oomkiller priority at all? (I think there's like three ways to do this now on Linux because of course there is.)
The text was updated successfully, but these errors were encountered: