Skip to content

Commit 8020ae0

Browse files
committed
Fill up KEP
Signed-off-by: Itamar Holder <[email protected]>
1 parent 0b43e10 commit 8020ae0

File tree

3 files changed

+158
-27
lines changed

3 files changed

+158
-27
lines changed
+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 5246
2+
alpha:
3+
approver: "deads2k"

keps/sig-node/5246-cgroup-cpu-share-to-weight-conversion/README.md

+141-9
Original file line numberDiff line numberDiff line change
@@ -3,28 +3,28 @@
33
44
To get started with this template:
55
6-
- [ ] **Pick a hosting SIG.**
6+
- [X] **Pick a hosting SIG.**
77
Make sure that the problem space is something the SIG is interested in taking
88
up. KEPs should not be checked in without a sponsoring SIG.
9-
- [ ] **Create an issue in kubernetes/enhancements**
9+
- [X] **Create an issue in kubernetes/enhancements**
1010
When filing an enhancement tracking issue, please make sure to complete all
1111
fields in that template. One of the fields asks for a link to the KEP. You
1212
can leave that blank until this KEP is filed, and then go back to the
1313
enhancement and add the link.
14-
- [ ] **Make a copy of this template directory.**
14+
- [X] **Make a copy of this template directory.**
1515
Copy this template into the owning SIG's directory and name it
1616
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
1717
leading-zero padding) assigned to your enhancement above.
18-
- [ ] **Fill out as much of the kep.yaml file as you can.**
18+
- [X] **Fill out as much of the kep.yaml file as you can.**
1919
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
2020
"Status", and date-related fields.
21-
- [ ] **Fill out this file as best you can.**
21+
- [X] **Fill out this file as best you can.**
2222
At minimum, you should fill in the "Summary" and "Motivation" sections.
2323
These should be easy if you've preflighted the idea of the KEP with the
2424
appropriate SIG(s).
25-
- [ ] **Create a PR for this KEP.**
25+
- [X] **Create a PR for this KEP.**
2626
Assign it to people in the SIG who are sponsoring this process.
27-
- [ ] **Merge early and iterate.**
27+
- [X] **Merge early and iterate.**
2828
Avoid getting hung up on specific details and instead aim to get the goals of
2929
the KEP clarified and merged quickly. The best way to do this is to just
3030
start with the high-level sections and fill out details incrementally in
@@ -58,7 +58,7 @@ If none of those approvers are still appropriate, then changes to that list
5858
should be approved by the remaining approvers and/or the owning SIG (or
5959
SIG Architecture for cross-cutting KEPs).
6060
-->
61-
# KEP-NNNN: Your short, descriptive title
61+
# KEP-5246: Migrate to systemd's cgroup v1 CPU shares to v2 CPU weight formula
6262

6363
<!--
6464
This is the title of your KEP. Keep it short, simple, and descriptive. A good
@@ -79,7 +79,10 @@ tags, and then generate with `hack/update-toc.sh`.
7979
<!-- toc -->
8080
- [Release Signoff Checklist](#release-signoff-checklist)
8181
- [Summary](#summary)
82+
- [Examples for the current state](#examples-for-the-current-state)
8283
- [Motivation](#motivation)
84+
- [A non-Kubernetes workload has a much higher priority in v2](#a-non-kubernetes-workload-has-a-much-higher-priority-in-v2)
85+
- [A too-small granularity](#a-too-small-granularity)
8386
- [Goals](#goals)
8487
- [Non-Goals](#non-goals)
8588
- [Proposal](#proposal)
@@ -128,7 +131,7 @@ checklist items _must_ be updated for the enhancement to be released.
128131

129132
Items marked with (R) are required *prior to targeting to a milestone / release*.
130133

131-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
134+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
132135
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
133136
- [ ] (R) Design details are appropriately documented
134137
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
@@ -173,6 +176,70 @@ updates.
173176
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
174177
-->
175178

179+
Kubernetes was originally implemented with v1 in mind.
180+
In cgroup v1, the CPU shares were defined very simply by assigning the container's CPU requests in a millicpu form.
181+
182+
As an example, for a container requesting 1 CPU (which equals to `1024m` cpu): `cpu.shares = 1024`.
183+
184+
After a while, when there was a need to support and move the focus to cgroup v2, a [dedicated KEP-2254](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) was submitted.
185+
186+
Cgroup v1 and v2 have very different ranges of values for CPU shares and weight.
187+
188+
Cgroup v1 uses a range of `[2^1 - 2^18] == [2 - 262144]` for CPU shares.
189+
190+
Cgroup v2 uses a range of `[10^0 - 10^4] == [1 - 10000]` for CPU weight.
191+
192+
As part of this KEP, it was agreed to use the following formula to perform the conversion from cgroup v1's cpu.shares to cgroup v2's CPU weight, as can be seen [here](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2#phase-1-convert-from-cgroups-v1-settings-to-v2):
193+
194+
`cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142) // convert from [2-262144] to [1-10000]`
195+
196+
### Examples for the current state
197+
198+
Let's start with an example to understand how the cgroup configuration looks like on both environments.
199+
200+
I'll use the following dummy pod and run it on v1 and v2 setups:
201+
```yaml
202+
apiVersion: v1
203+
kind: Pod
204+
metadata:
205+
name: dummy-sleeping-pod
206+
spec:
207+
containers:
208+
- name: sleep-container
209+
image: busybox
210+
command: ["sleep", "infinity"]
211+
resources:
212+
requests:
213+
cpu: 1
214+
```
215+
216+
On cgroup v1 the underlying configuration is pretty intuitive:
217+
```shell
218+
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu/cpu.shares"
219+
1024
220+
```
221+
222+
On v2, the configuration looks like the following:
223+
```shell
224+
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu.weight"
225+
39
226+
```
227+
228+
And indeed, according to the formula above, `cpu.weight = (1+((1024-2)*9999)/262142) ~= 39.9`.
229+
230+
If I would change the pod to consume only `100m` CPU, the configuration will look like the following:
231+
on v1:
232+
```shell
233+
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu/cpu.shares"
234+
102
235+
```
236+
237+
on v2:
238+
```shell
239+
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu.weight"
240+
4
241+
```
242+
176243
## Motivation
177244

178245
<!--
@@ -184,19 +251,47 @@ demonstrate the interest in a KEP within the wider Kubernetes community.
184251
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
185252
-->
186253

254+
The above formula focuses on converting the values from one range to another, keeping the values in the same percentile in this range. As an example, if a certain value of cpu shares is 20% of the range, it will stay 20% of the range when it's converted to cgroup v2.
255+
256+
However, this imposes several problems.
257+
258+
### A non-Kubernetes workload has a much higher priority in v2
259+
The default CPU shares for cgroup v1 is `1024`.
260+
This means that when kubernetes workloads would compete with non-kubernetes workloads (system daemons, drivers, kubelet itself, etc), a container requesting 1 CPU has the same CPU priority as a "regular" process. Asking for less than 1 CPU will grand lower priority, and vice-versa.
261+
262+
However, in cgroup v2, the default CPU weight is `100`.
263+
This means that (as can be seen above) a container asking for 1 CPU now has less than 40% of the default CPU weight.
264+
265+
The implication is that Kubernetes workloads have much less CPU priority against non-Kubernetes workloads under v2.
266+
267+
### A too-small granularity
268+
As can be seen above, a container that requests for `100m` CPU only has a CPU weight of `4`, while on v1 it would have `102` CPU shares.
269+
270+
This value is not granular enough.
271+
This is relevant for use-cases in which sub-cgroups need to be configured inside a container to further distribute resources inside the container.
272+
273+
As an example, there could be a container running a few CPU intensive processes and one managerial process that does not need to consume a lot of CPU, but needs to be very responsive. In such a case, sub-cgroups can be created inside the container, leaving 90% of the weight to the CPU-bound processes and 10% to the other process.
274+
275+
187276
### Goals
188277

189278
<!--
190279
List the specific goals of the KEP. What is it trying to achieve? How will we
191280
know that this has succeeded?
192281
-->
282+
- Just like in cgroup v1, when a container asks for 1 CPU it should get the default amount of CPU weight, which is `100`.
283+
In the same way, asking for 500m CPU should result in having `50` CPU weight, and so on.
284+
This aligns the v1 and v2 behaviors.
285+
- Track that the different layers (OCI, CRI, Kubelet, etc) are aligned with the new formula.
193286

194287
### Non-Goals
195288

196289
<!--
197290
What is out of scope for this KEP? Listing non-goals helps to focus discussion
198291
and make progress.
199292
-->
293+
- Introduce new APIs to configure cgroups.
294+
- Change CPU priorities between Kubernetes workloads.
200295

201296
## Proposal
202297

@@ -209,6 +304,25 @@ The "Design Details" section below is for the real
209304
nitty-gritty.
210305
-->
211306

307+
As [suggested](https://github.com/kubernetes/kubernetes/issues/131216#issuecomment-2806442083)
308+
by the [original KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) author,
309+
we should migrate to systemd's formula for converting CPU shares to CPU weight.
310+
311+
The current formula that we're using is:
312+
`cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142)`
313+
314+
While systemd's formula is:
315+
`cpu.weight = 1 + ((cpu.shares – 2) × 99) / (1024 – 2)`
316+
317+
The main difference between the formulas is that the first one focuses on converting the values from one range to another,
318+
keeping the values in the same percentile in this range.
319+
The second one focuses on aligning the values to the default CPU weight of `100` for a container requesting 1 CPU.
320+
Accordingly, 512m CPU will result in a value of `50`, which aligns with the default v1 CPU shares of `512` for a container requesting 1 CPU.
321+
322+
Let cpu.shares be 1024. Therefore: `1 + ((cpu.shares – 2) × 99) / (1024 – 2) == 100`.
323+
324+
Let cpu.shares be 512. Therefore: `1 + ((cpu.shares – 2) × 99) / (1024 – 2) == 50.4031 ~= 50`.
325+
212326
### User Stories (Optional)
213327

214328
<!--
@@ -219,8 +333,14 @@ bogged down.
219333
-->
220334

221335
#### Story 1
336+
As a Kubernetes user, when I used to work with v1, a container asking for 1 CPU had the same CPU shares as a non-kubernetes
337+
workload. After moving to cgroup v2 I expect this behavior to stay the same, but instead,
338+
the CPU priority for Kubernetes workloads is much lower than non-kubernetes workloads compared to v1.
222339

223340
#### Story 2
341+
As a Kubernetes user, I want to be able to configure sub-cgroups inside a container to further
342+
distribute resources inside the container. While I could do that nicely with v1, with v2 the granularity is not fine-grained enough.
343+
As an example, `100m` CPU on v2 results with `4` CPU weight, while on v1 it would be `102` shares.
224344

225345
### Notes/Constraints/Caveats (Optional)
226346

@@ -231,6 +351,11 @@ Go in to as much detail as necessary here.
231351
This might be a good place to talk about core concepts and how they relate.
232352
-->
233353

354+
- A significant amount of the work would need to land in other layers, mainly OCI runtimes and the CRI.
355+
- We'll probably need a CRI configuration to ensure coordination between the CRI and the OCI runtimes implementations,
356+
and to ensure it lands at the same version, as suggested
357+
[here](https://github.com/kubernetes/kubernetes/issues/131216#issuecomment-2806656165).
358+
234359
### Risks and Mitigations
235360

236361
<!--
@@ -245,6 +370,13 @@ How will UX be reviewed, and by whom?
245370
Consider including folks who also work outside the SIG or subproject.
246371
-->
247372

373+
The main risk comes from the fact that we use the same formula for a decent amount of time, hence there's always
374+
a risk that a user relies on the exact values that we're using.
375+
376+
That being said, the formula in entirely an implementation detail that's most probably not being counted
377+
to have certain concrete values. In any way, we should ensure that the new formula is well documented
378+
and that the change is properly communicated to the users.
379+
248380
## Design Details
249381

250382
<!--

keps/sig-node/5246-cgroup-cpu-share-to-weight-conversion/kep.yaml

+14-18
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,37 @@
1-
title: KEP Template
2-
kep-number: NNNN
1+
title: Migrate to systemd's cgroup v1 CPU shares to v2 CPU weight formula
2+
kep-number: 5246
33
authors:
4-
- "@jane.doe"
5-
owning-sig: sig-xyz
4+
- "@iholder101"
5+
owning-sig: sig-node
66
participating-sigs:
7-
- sig-aaa
8-
- sig-bbb
9-
status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced
10-
creation-date: yyyy-mm-dd
7+
- sig-node
8+
status: implementable
9+
creation-date: 2025-04-16
1110
reviewers:
1211
- TBD
13-
- "@alice.doe"
1412
approvers:
1513
- TBD
16-
- "@oscar.doe"
1714

1815
see-also:
19-
- "/keps/sig-aaa/1234-we-heard-you-like-keps"
20-
- "/keps/sig-bbb/2345-everyone-gets-a-kep"
16+
- "/keps/sig-node/2254-cgroup-v2"
2117
replaces:
22-
- "/keps/sig-ccc/3456-replaced-kep"
18+
- "/keps/sig-node/2254-cgroup-v2"
2319

2420
# The target maturity stage in the current dev cycle for this KEP.
2521
# If the purpose of this KEP is to deprecate a user-visible feature
2622
# and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
27-
stage: alpha|beta|stable
23+
stage: alpha
2824

2925
# The most recent milestone for which work toward delivery of this KEP has been
3026
# done. This can be the current (upcoming) milestone, if it is being actively
3127
# worked on.
32-
latest-milestone: "v1.19"
28+
latest-milestone: "v1.34"
3329

3430
# The milestone at which this feature was, or is targeted to be, at each stage.
3531
milestone:
36-
alpha: "v1.19"
37-
beta: "v1.20"
38-
stable: "v1.22"
32+
alpha: TBD
33+
beta: TBD
34+
stable: TBD
3935

4036
# The following PRR answers are required at alpha release
4137
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)