Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost per job #93

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open

Cost per job #93

wants to merge 3 commits into from

Conversation

cmelone
Copy link
Collaborator

@cmelone cmelone commented Aug 20, 2024

Closes #75

Computes and stores the following metrics:

  • cpu_cost: cost of using CPU resources on the node, based on the CPU request of the job
  • mem_cost
  • cpu_penalty: penalty factor that represents the over or under allocation of CPU resources
  • mem_penalty

To normalize the cost of resources within instance types, we'll define cost per resource metrics.

$$\text{Cost per CPU}_i = \frac{C_i \times 0.5}{\text{CPU}_i}$$ $$\text{Cost per RAM}_i = \frac{C_i \times 0.5}{\text{RAM}_i}$$
  • $C_i$ is the cost of the node over the life of the job
  • $CPU_i$ is the number of CPUs available on node $i$
  • $RAM_i$ is the amount of RAM available on node $I$
  • $C_i$ should be divided in half when computing cost per resource numbers, as we assume that CPU and RAM represent the two halves of the cost of the node.
$$\text{Job Cost} = (\text{CPU}_{\text{usage}} \times \text{Cost per CPU}_i + \text{RAM}_{\text{usage}} \times \text{Cost per RAM}_i)$$

Using this base cost per job metric, jobs are rewarded for minimizing usage and wall time. However, it does not penalyze them for disruptions to the cluster caused by misallocation.

Underallocation can potentially slow down other jobs on the same node, and overallocation delays scheduling of other jobs. A penalty factor would be useful for quantifying negative impacts to the CI system and encourage better resource requests.

$$\text{P}_{\text{CPU}} = |\text{CPU}_{usage}-\text{CPU}_{request}|$$ $$\text{P}_{\text{RAM}} = |\text{RAM}_{usage}-\text{RAM}_{request}|$$

With the penalty, cost per job would be:

$$((\text{CPU}_{\text{usage}} + \text{P}_{\text{CPU}}) \times \text{Cost per CPU}_i + (\text{RAM}_{\text{usage}} + \text{P}_{\text{RAM}}) \times \text{Cost per RAM}_i )$$

Job cost and $P$ are stored separately as the former represents "true" cost, while the latter can be used to measure the efficiency of its resource requests via an artificial penalty. When analyzing costs, node instance type should be controlled for because cost per job is influenced by $\text{Cost per CPU}_i$ and $\text{Cost per RAM}_i$, which will vary among instance types.


For example:

  • instance costs 100 cents per hour
  • instance has 100GB memory and 100 cores
  • job duration was 30 minutes
  • resource requests: 2GB memory, 2 cores
  • mean usage: 1GB memory, 5 cores

$C_i$ = 50 cents (cost of the instance while the job ran)

$$\text{Cost per CPU}_i = \frac{50 \times 0.5}{\text{100}} = 0.25 cents$$ $$\text{Cost per RAM}_i = \frac{50 \times 0.5}{\text{100}} = 0.25 cents$$

therefore,

$$\text{Cost for RAM}_i = 0.25 \times 1 = 0.25 cents$$ $$\text{Cost for CPUs}_i = 0.25 \times 5 = 2.5 cents$$

computing the penalties:

$$\text{P}_{\text{CPU}} = |5-2|= 3$$ $$\text{P}_{\text{RAM}} = |1-2| = 1$$

In this case, we penalize the job for using more CPU than it requested, which could have crowded out other jobs. We also penalize the job for using less RAM than requested because when k8s scheduled the job, it blocked those resources from being scheduled for other work.

"total" cost:

$$((5 + 3) \times 0.25 + (1 + 1) \times 0.25) = 2.5$$

@cmelone cmelone self-assigned this Aug 20, 2024
@cmelone cmelone requested a review from alecbcs August 20, 2024 20:30
@cmelone cmelone added the feature New feature or request label Aug 20, 2024
@cmelone cmelone marked this pull request as draft August 27, 2024 15:20
@cmelone cmelone changed the title Collect node spot instance costs Cost per job Oct 10, 2024
@cmelone cmelone force-pushed the add/collect-cost branch 2 times, most recently from 4016e90 to b3c5b7b Compare October 10, 2024 20:19
@cmelone cmelone marked this pull request as ready for review October 10, 2024 20:37
@cmelone cmelone requested a review from tgamblin October 10, 2024 20:39
@cmelone
Copy link
Collaborator Author

cmelone commented Oct 10, 2024

requesting @tgamblin review of cost formula not code

@cmelone cmelone mentioned this pull request Oct 23, 2024
Closes #75

Computes and stores the following metrics:

- cpu_cost: cost of using CPU resources on the node, based on the CPU request of the job
- mem_cost
- cpu_penalty: penalty factor that represents the over or under allocation of CPU resources
- mem_penalty
if in the future we'd like to make modifications with this cost analysis it'd be useful to have these metrics as the basis of the calculation
cmelone added a commit that referenced this pull request Nov 12, 2024
Pulled from #93 to collect data before deciding on final cost formula.

Adds the following columns:
- nodes: AWS zone (`zone`)
- nodes: Instance capacity type (`capacity_type`)
- jobs:  Cost of the instance during the lifetime of the job (`job_cost_instance`)

The `job_cost_instance` calculation is made by averaging the value of `karpenter_cloudprovider_instance_type_offering_price_estimate` during the lifetime of the node and multiplying by the duration of the build job.

**This is not a cost per job metric.** Use information like cpu_mean, mem_mean, etc to calculate the cost of the job in combination with `job_cost_instance`.

tested with `dev/bulk_collect.py` and verified that large migrations work correctly on the prod and staging db
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cost measurement and analysis
1 participant