Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry OOM killed jobs #4

Draft
wants to merge 12 commits into
base: develop
Choose a base branch
from
Draft

Retry OOM killed jobs #4

wants to merge 12 commits into from

Conversation

cmelone
Copy link
Collaborator

@cmelone cmelone commented Jan 23, 2024

Automatically retries jobs if they are OOM killed after gantry underallocates memory.


  • Implements a pipeline webhook handler. Checks if any jobs failed due to being OOM killed and inserts them into the db.
  • Creates a new pipeline following spackbot's example and gitlab's api

As discussed in #74, we would prefer to restart jobs directly by supplying new variables, but this feature is not supported by gitlab. When a new pipeline is created for a ref, successful builds in the previous pipeline will be pruned in the generate/concretization step, minimizing wasted cycles.

During the generate step, gantry will receive a request for resource allocations for a job that was recently OOM killed. The program will look for an exact spec match in the database and return these modified variables:

  • KUBERNETES_MEMORY_LIMIT * 0.2 -- bump the past request by 20%
  • GANTRY_RETRY_COUNT +=1 -- maintain a count of how many times this spec has been retried
  • cpu request/limit and memory limit will remain unmodified
  • maybe: GANTRY_RETRY_ID -- gitlab id for the original job to link between retries...not sure if necessary

To ensure we don't fall into an infinite loop of increasing memory limits, gantry will not bump the limit if the retry count will exceed three. Additionally, it will not restart a pipeline if all OOM killed jobs also exceed the retry limit. This means we are allowing certain jobs to fail. If we investigate and it seems like an increase is warranted, how do we ensure that this gets communicated to gantry?

New optional columns in the jobs table:

  • oomed -- failed job OOM killed
  • retry_count -- number of times the job has been retried

TODO:

  • fix oom detection in k8s cluster. see kitware-llnl channel
  • tests
  • figure out how you'll deal with missing data issue -- use the last timestamp as a datapoint? (what happens during missed webhooks?)
  • get api permissions for restarting pipelines in spack-infra terraform config
  • add new variables to annotations in spack-infra k8s config

Questions:

  • Do we need to weigh the most current build higher after has been retried and bumped up?
    • No, we will allow the genetic algorithm to learn. If subsequent jobs are OOM killed, they will be given more memory and retried, which will eventually lead to an optimal memory limit.

@cmelone cmelone added the feature New feature or request label Jan 23, 2024
@cmelone cmelone self-assigned this Jan 23, 2024
@cmelone cmelone mentioned this pull request Jan 24, 2024
1 task
Base automatically changed from add/collection-func to develop February 12, 2024 19:19
@cmelone cmelone mentioned this pull request Sep 17, 2024
3 tasks
@cmelone cmelone changed the title Handle build OOMs Retry OOM killed jobs Sep 17, 2024
@cmelone cmelone marked this pull request as ready for review September 18, 2024 22:03
@cmelone
Copy link
Collaborator Author

cmelone commented Sep 18, 2024

@alecbcs the pipeline webhook aspect of this PR is ready for review. This first step essentially detects OOM killed jobs and inserts them into the db.

Like I mentioned in the kitware channel, OOM detection is broken atm, so I will update the prometheus.job.is_oom method once we have that cleared up

@github-actions github-actions bot added the docs Improvements or additions to documentation label Sep 19, 2024
@cmelone cmelone marked this pull request as draft October 9, 2024 18:32
- if pipeline failed, check if any of the jobs failed due to OOM
- insert OOM jobs into database for prediction step
If a job is over the defined retry limit, we won't mark it as needing to be retried. However, because we are handling this on a pipeline level, if another job in the pipeline was OOMed but not over the retry limit the pipeline will still be retried, leading to some idiosyncrasies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Improvements or additions to documentation feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant