worker agent on linux might be getting oomkilled when it's the job's fault instead #64

iliana · 2024-10-14T17:46:28Z

I think I'm seeing a worker agent experienced a fatal error; aborting job error on this run because the worker agent is getting oomkilled on Linux:

https://buildomat.eng.oxide.computer/wg/0/details/01JA5Z0YAABH97EZSWA21ZYMM7/dOyWW4nBzXdj1VMvNjVicWxjHQqHQlHElpGIaVzF4AojHD2t/01JA5Z1D0C5BGA3FKYBRVB60Q9

This is occurring after TestLint/TestErrCheck, which runs a command that completely exhausts the 32 GB of RAM on a machine I'm debugging this on. I wonder if the entire cgroup is getting axed and not just the underlying job process.

Does the agent set its oomkiller priority at all? (I think there's like three ways to do this now on Linux because of course there is.)

The text was updated successfully, but these errors were encountered:

jclulow · 2024-10-14T18:14:30Z

It does not, but I would be happy to make use of whatever cgroup/oomkiller APIs make sense for a control agent that should absolutely not die!

jclulow · 2024-10-14T18:28:07Z

@iliana I think what I would like to have is:

the ability to disable the OOM killer completely for the buildomat agent, without accidentally disabling it for children we fork
the ability to include a crisp event in the job event stream that lists the name and PID of anything else that gets OOM killed

Is there a good API for listening to OOM kill events or am I going to have to tail a log or have a journalctl child or something 😅

iliana · 2024-10-14T18:45:21Z

I have never looked into this beyond the small bit of log message that lives in my head where OpenSSH tells you it's setting it's oom_score_adj at startup. (In the case of OpenSSH, obviously it's doing something to make sure the shells it spawns as users that are logging in are not inheriting that oom_score_adj.)

I assume the "right" way to do this would be for the agent to create a new cgroup and run the program inside the cgroup, configuring it to be the first to go when the RAM runs out. What is actually done beyond that point to understand why the process was killed is not something I'm immediately aware of.

There's also systemd-oomd, I'm not sure how recent it is (relevant for the Ubuntu images older than 24.04), but it apparently exists because the Linux kernel's oomkiller logic leaves a lot to be desired and can't really dynamically take things into account beyond the one knob of oom_score_adj; one knob does not a policy make.

iliana changed the title ~~worker agent on linux might be getting oomkilled~~ worker agent on linux might be getting oomkilled when it's the job's fault instead Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker agent on linux might be getting oomkilled when it's the job's fault instead #64

worker agent on linux might be getting oomkilled when it's the job's fault instead #64

iliana commented Oct 14, 2024

jclulow commented Oct 14, 2024

jclulow commented Oct 14, 2024

iliana commented Oct 14, 2024

worker agent on linux might be getting oomkilled when it's the job's fault instead #64

worker agent on linux might be getting oomkilled when it's the job's fault instead #64

Comments

iliana commented Oct 14, 2024

jclulow commented Oct 14, 2024

jclulow commented Oct 14, 2024

iliana commented Oct 14, 2024