Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resource limits #106

Open
wants to merge 17 commits into
base: develop
Choose a base branch
from
Open

Add resource limits #106

wants to merge 17 commits into from

Conversation

cmelone
Copy link
Collaborator

@cmelone cmelone commented Sep 27, 2024

This is the first version of our prediction formulas for max cpu and memory.

This PR also sets SPACK_BUILD_JOBS equal to the CPU request (nearest core).

Using the included simulation script, I ran a scenario where we allocated resources for 8000 specs.


The max memory predictions includes a 20% "bump" to avoid the OOM killing of ~1100 jobs.

The ratio of actual usage/predicted usage (mem) was 0.6963, meaning that we are overallocating by 30%.
However, 437 jobs were killed, representing an OOM rate of 0.055, far higher than we would like.

@alecbcs and I discussed alternative prediction strategies that include factoring the ratio of mem/cores.

For example, if we take a look at a job that was predicted to use 3x less memory at peak than it actually used, and the data used to make this prediction:

[email protected] ~guile build_system=generic%[email protected] gitlab_id=12859608
duration: 1262 cpu_mean: 0.621, cpu_max: 0.956, mem_mean: 2590.126, mem_max: 4448.702

samples:
duration: 181 cpu_mean: 0.169, cpu_max: 0.424, mem_mean: 105.722, mem_max: 168.37
duration: 149 cpu_mean: 0.531, cpu_max: 1.064, mem_mean: 702.054, mem_max: 1033.888
duration: 107 cpu_mean: 0.283, cpu_max: 0.415, mem_mean: 95.556, mem_max: 149.381
duration: 432 cpu_mean: 0.31, cpu_max: 1.051, mem_mean: 100.313, mem_max: 1300.226
duration: 396 cpu_mean: 0.268, cpu_max: 1.023, mem_mean: 172.576, mem_max: 1364.505

This package usually takes 4-5 minutes to build, but instead took 21 minutes and peaked at nearly 4x memory.

In my opinion, there is no data available to us that would allow us to make an accurate prediction in this scenario. This is the case for most of the outliers that I've seen. In this scenario, the job in question may have been manipulated by a noisy neighbor not respecting their allocation.

My vote is to keep the formula as-is, and tweak it once we deploy gantry to the staging cluster and with limits in place.


The ratio for max cpu was 0.9546.

@cmelone cmelone self-assigned this Sep 27, 2024
@cmelone cmelone marked this pull request as ready for review October 8, 2024 18:00
@cmelone cmelone changed the title draft: add resource limits Add resource limits Oct 8, 2024
@cmelone
Copy link
Collaborator Author

cmelone commented Oct 23, 2024

will rebase this as well as #93

@cmelone
Copy link
Collaborator Author

cmelone commented Oct 25, 2024

past thread on deciding # of build jobs: spack/spack#26242

@HadrienG2 I figured you might be interested to know we're working on this for our CI, the approach is quite similar to your comment

@github-actions github-actions bot added the ci Involving Project CI & Unit Tests label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Involving Project CI & Unit Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant