Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker agent on linux might be getting oomkilled when it's the job's fault instead #64

Open
iliana opened this issue Oct 14, 2024 · 3 comments

Comments

@iliana
Copy link

iliana commented Oct 14, 2024

I think I'm seeing a worker agent experienced a fatal error; aborting job error on this run because the worker agent is getting oomkilled on Linux:

https://buildomat.eng.oxide.computer/wg/0/details/01JA5Z0YAABH97EZSWA21ZYMM7/dOyWW4nBzXdj1VMvNjVicWxjHQqHQlHElpGIaVzF4AojHD2t/01JA5Z1D0C5BGA3FKYBRVB60Q9

This is occurring after TestLint/TestErrCheck, which runs a command that completely exhausts the 32 GB of RAM on a machine I'm debugging this on. I wonder if the entire cgroup is getting axed and not just the underlying job process.

Does the agent set its oomkiller priority at all? (I think there's like three ways to do this now on Linux because of course there is.)

@iliana iliana changed the title worker agent on linux might be getting oomkilled worker agent on linux might be getting oomkilled when it's the job's fault instead Oct 14, 2024
@jclulow
Copy link
Collaborator

jclulow commented Oct 14, 2024

It does not, but I would be happy to make use of whatever cgroup/oomkiller APIs make sense for a control agent that should absolutely not die!

@jclulow
Copy link
Collaborator

jclulow commented Oct 14, 2024

@iliana I think what I would like to have is:

  • the ability to disable the OOM killer completely for the buildomat agent, without accidentally disabling it for children we fork
  • the ability to include a crisp event in the job event stream that lists the name and PID of anything else that gets OOM killed

Is there a good API for listening to OOM kill events or am I going to have to tail a log or have a journalctl child or something 😅

@iliana
Copy link
Author

iliana commented Oct 14, 2024

I have never looked into this beyond the small bit of log message that lives in my head where OpenSSH tells you it's setting it's oom_score_adj at startup. (In the case of OpenSSH, obviously it's doing something to make sure the shells it spawns as users that are logging in are not inheriting that oom_score_adj.)

I assume the "right" way to do this would be for the agent to create a new cgroup and run the program inside the cgroup, configuring it to be the first to go when the RAM runs out. What is actually done beyond that point to understand why the process was killed is not something I'm immediately aware of.

There's also systemd-oomd, I'm not sure how recent it is (relevant for the Ubuntu images older than 24.04), but it apparently exists because the Linux kernel's oomkiller logic leaves a lot to be desired and can't really dynamically take things into account beyond the one knob of oom_score_adj; one knob does not a policy make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants