-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance lifetime timeout #836
base: main
Are you sure you want to change the base?
Conversation
This adds a timeout based on instance lifetime, intended to be a supplement to idle-timeout in the agent itself. It prevents busy agents from living forever if they never get enough idle time to turn themselves off.
This is a creative solution. However, periodically restarting the agent seems like a bandaid fix of the agent failing to clean up after itself. I wonder how the agents currently manage disk size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dbaggerman! I've shared this internally to collect input from folks.
I assume most elastic stack users rely on the instances self-terminating after the idle timeout to avoid similar issues - does that work less effectively for you because you use AgentsPerInstance>1
and you need all agents on the instance to reach the idle timeout for an instance to self-terminate?
Builds will early-fail if a docker image prune is unable to free up enough space: elastic-ci-stack-for-aws/packer/linux/conf/buildkite-agent/hooks/environment Lines 25 to 35 in bffd450
.. and an hourly cron job will cause the instance to fail a healthcheck if the same thing happens: elastic-ci-stack-for-aws/packer/linux/conf/docker/cron.hourly/docker-low-disk-gc Lines 29 to 38 in bffd450
No doubt there are may other ways to consume disk space that we currently have no tooling in place to mitigate though. |
We mostly rely on the idle timeout as well, but we have one queue where we used to regularly have instances alive for days despite having the idle time configured. I suspect that having multiple agents on the instance and needing them all idle at the same time to reach the timeout contributed to that, although the queue is shared by builds/teams in different timezones so it can have jobs queued around the clock. Also, while disk space is the most common problem we have run into other problems on long running instances as well. For example, at one point we were accumulating running containers. It turned out that sidecar containers were being started by
Being hourly, that can result in a lot of jobs failing for up to an hour before the next time the cron runs. Bringing that back to run more regularly would make it less painful if it does occur but would still be a reacting to a problem rather than proactively preventing it. |
We had this problem on jenkins and the cloud node plugin has a feature to terminate the instance after x builds have run. We set it to 100 builds and we have a daily instance rotation. An even simpler method could be configuring the MaxInstanceLifetime on the asg.
|
The buildkite agent has a
This may have the side-effect of interrupting any jobs in progress on the agent. You don't get much of a grace period for running jobs to finish before the instance terminates. |
Isn't there a lifecycle hook in the stack that allows jobs to finish before continuing the termination ? If so, I believe the instance refresh respects this... but perhaps there isn't a hook. I don't see one defined in the cf stack. |
The elastic stack image does include https://github.com/buildkite/lifecycled to manage the lifecycle events, which is probably what you're thinking of. Even with that though, the autoscaling group will only let you delay the shutdown for an hour or so. We have jobs that run much longer than that which would get still get interrupted - although using it as an alternative to the hard timer in the PR would work. |
This is something we've been running in our version of the elastic stack, which I thought might be of interest upstream. This is pretty much cut and paste from what we do internally, and while this behaviour (with these timeouts) may or may not be ideal as defaults for everyone I thought it would be worthwhile to start a discussion.
We have one agent queue which acts as a shared pool for a variety of tasks. These instances are also configured to run several agents per instance. The combination of these two things means that even with a reasonably short
idle-timeout
it was often the case that instances could go days (if not weeks) without hitting their idle timeout, up to the point where the disk would fill up and jobs would start failing. At which point we'd have to go and kill them manually.After several attempts at managing disk space on long-running instances, this is the solution we came up with. We haven't had any of that kind of problem since implementing this, although it is what ultimately led to buildkite/buildkite-agent-scaler/issues/39.
What this does is start a pair of systemd timers when the instance starts. After three hours, the first timer will be reached and trigger a job which sends a
TERM
signal to the agent telling it to stop accepting new jobs. Once the running jobs complete the agent will shut down, triggering the standard shutdown behaviour.If any builds are in a stuck/hung state (or otherwise still running after 21+ hours), then the second timer will be reached once the instance has been alive for 24 hours (21 hours after the soft stop). This timer tells systemd to stop the
buildkite-agent
service (forcefully if necessary), which again results in the instance shutting down so it can be replaced.