You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kubernetes was originally implemented with v1 in mind.
180
+
In cgroup v1, the CPU shares were defined very simply by assigning the container's CPU requests in a millicpu form.
181
+
182
+
As an example, for a container requesting 1 CPU (which equals to `1024m` cpu): `cpu.shares = 1024`.
183
+
184
+
After a while, when there was a need to support and move the focus to cgroup v2, a [dedicated KEP-2254](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) was submitted.
185
+
186
+
Cgroup v1 and v2 have very different ranges of values for CPU shares and weight.
187
+
188
+
Cgroup v1 uses a range of `[2^1 - 2^18] == [2 - 262144]` for CPU shares.
189
+
190
+
Cgroup v2 uses a range of `[10^0 - 10^4] == [1 - 10000]` for CPU weight.
191
+
192
+
As part of this KEP, it was agreed to use the following formula to perform the conversion from cgroup v1's cpu.shares to cgroup v2's CPU weight, as can be seen [here](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2#phase-1-convert-from-cgroups-v1-settings-to-v2):
193
+
194
+
`cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142) // convert from [2-262144] to [1-10000]`
195
+
196
+
### Examples for the current state
197
+
198
+
Let's start with an example to understand how the cgroup configuration looks like on both environments.
199
+
200
+
I'll use the following dummy pod and run it on v1 and v2 setups:
201
+
```yaml
202
+
apiVersion: v1
203
+
kind: Pod
204
+
metadata:
205
+
name: dummy-sleeping-pod
206
+
spec:
207
+
containers:
208
+
- name: sleep-container
209
+
image: busybox
210
+
command: ["sleep", "infinity"]
211
+
resources:
212
+
requests:
213
+
cpu: 1
214
+
```
215
+
216
+
On cgroup v1 the underlying configuration is pretty intuitive:
217
+
```shell
218
+
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu/cpu.shares"
219
+
1024
220
+
```
221
+
222
+
On v2, the configuration looks like the following:
223
+
```shell
224
+
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu.weight"
225
+
39
226
+
```
227
+
228
+
And indeed, according to the formula above, `cpu.weight = (1+((1024-2)*9999)/262142) ~= 39.9`.
229
+
230
+
If I would change the pod to consume only `100m` CPU, the configuration will look like the following:
231
+
on v1:
232
+
```shell
233
+
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu/cpu.shares"
234
+
102
235
+
```
236
+
237
+
on v2:
238
+
```shell
239
+
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu.weight"
240
+
4
241
+
```
242
+
176
243
## Motivation
177
244
178
245
<!--
@@ -184,19 +251,47 @@ demonstrate the interest in a KEP within the wider Kubernetes community.
The above formula focuses on converting the values from one range to another, keeping the values in the same percentile in this range. As an example, if a certain value of cpu shares is 20% of the range, it will stay 20% of the range when it's converted to cgroup v2.
255
+
256
+
However, this imposes several problems.
257
+
258
+
### A non-Kubernetes workload has a much higher priority in v2
259
+
The default CPU shares for cgroup v1 is `1024`.
260
+
This means that when kubernetes workloads would compete with non-kubernetes workloads (system daemons, drivers, kubelet itself, etc), a container requesting 1 CPU has the same CPU priority as a "regular" process. Asking for less than 1 CPU will grand lower priority, and vice-versa.
261
+
262
+
However, in cgroup v2, the default CPU weight is `100`.
263
+
This means that (as can be seen above) a container asking for 1 CPU now has less than 40% of the default CPU weight.
264
+
265
+
The implication is that Kubernetes workloads have much less CPU priority against non-Kubernetes workloads under v2.
266
+
267
+
### A too-small granularity
268
+
As can be seen above, a container that requests for `100m` CPU only has a CPU weight of `4`, while on v1 it would have `102` CPU shares.
269
+
270
+
This value is not granular enough.
271
+
This is relevant for use-cases in which sub-cgroups need to be configured inside a container to further distribute resources inside the container.
272
+
273
+
As an example, there could be a container running a few CPU intensive processes and one managerial process that does not need to consume a lot of CPU, but needs to be very responsive. In such a case, sub-cgroups can be created inside the container, leaving 90% of the weight to the CPU-bound processes and 10% to the other process.
274
+
275
+
187
276
### Goals
188
277
189
278
<!--
190
279
List the specific goals of the KEP. What is it trying to achieve? How will we
191
280
know that this has succeeded?
192
281
-->
282
+
- Just like in cgroup v1, when a container asks for 1 CPU it should get the default amount of CPU weight, which is `100`.
283
+
In the same way, asking for 500m CPU should result in having `50` CPU weight, and so on.
284
+
This aligns the v1 and v2 behaviors.
285
+
- Track that the different layers (OCI, CRI, Kubelet, etc) are aligned with the new formula.
193
286
194
287
### Non-Goals
195
288
196
289
<!--
197
290
What is out of scope for this KEP? Listing non-goals helps to focus discussion
198
291
and make progress.
199
292
-->
293
+
- Introduce new APIs to configure cgroups.
294
+
- Change CPU priorities between Kubernetes workloads.
200
295
201
296
## Proposal
202
297
@@ -209,6 +304,25 @@ The "Design Details" section below is for the real
209
304
nitty-gritty.
210
305
-->
211
306
307
+
As [suggested](https://github.com/kubernetes/kubernetes/issues/131216#issuecomment-2806442083)
308
+
by the [original KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) author,
309
+
we should migrate to systemd's formula for converting CPU shares to CPU weight.
0 commit comments