-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-1287: Allow memory limit decreases #5368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -390,9 +390,6 @@ WindowsPodSandboxConfig. | |||||
configurations will see values that may not represent actual configurations. As a | ||||||
mitigation, this change needs to be documented and highlighted in the | ||||||
release notes, and in top-level Kubernetes documents. | ||||||
1. Resizing memory lower: Lowering cgroup memory limits may not work as pages | ||||||
could be in use, and approaches such as setting limit near current usage may | ||||||
be required. This issue needs further investigation. | ||||||
1. Scheduler race condition: If a resize happens concurrently with the scheduler evaluating the node | ||||||
where the pod is resized, it can result in a node being over-scheduled, which will cause the pod | ||||||
to be rejected with an `OutOfCPU` or `OutOfMemory` error. Solving this race condition is out of | ||||||
|
@@ -810,11 +807,17 @@ Setting the memory limit below current memory usage can cause problems. If the k | |||||
sufficient memory, the outcome depends on the cgroups version. With cgroups v1 the change will | ||||||
simply be rejected by the kernel, whereas with cgroups v2 it will trigger an oom-kill. | ||||||
|
||||||
In the initial beta release of in-place resize, we will **disallow** `PreferNoRestart` memory limit | ||||||
decreases, enforced through API validation. The intent is for this restriction to be relaxed in the | ||||||
future, but the design of how limit decreases will be approached is still undecided. | ||||||
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a | ||||||
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
guarantees. Before decreasing container memory limits, the Kubelet will read the container memory | ||||||
usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this happens, would it try again in the next sync loop iteration? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so. @tallclair Please correct me if I'm wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the runtime would be better for this, or even OCI runtime,as it closes the TOCTOU window. There could be a specified CRI error that could be bubbled up that the kubelet could respond with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, but I'm hesitant to plumb this down into the runtime before we have a better understanding of what the long-term direction is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what are the alternatives besides reverting the behavior change in the kernel? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe leveraging There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can use
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm guessing this would be a better fit for the CRI runtime layer than runc. @chrishenzie @samuelkarp @mikebrow can any of you chime in for containerd, and whether this approach seems viable? Would it make sense to implement this in a shared library that cri-o & containerd could both use? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. really OCI runtime would be best as that's closest, which can be considered a shared layer. If the OCI runtime doesn't support it CRI could do it as a backup, using runtime features to figure it out maybe? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the CRI API supports There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about this more, I'm worried about skew between enabling memory limit decreases, and getting the protection logic into the runtime (either OCI or CRI layer). This leaves 2 paths forward:
The restriction on memory limit decrease is an overly strict limitation that constrains usage of IPPR, especially for guaranteed pods. I'm worried that the first approach (wait for runtime implementation) will delay this too long, so I propose we go with the second option. I still agree that we ultimately want to implement this in the runtime, but I think we should put the basic check in the Kubelet to unblock this feature. WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my 2 cents.
@roycaihw Using
@tallclair 👍 |
||||||
skipped for that container. The pod condition `PodResizeInProgress` will remain, with an `Error` | ||||||
reason, and a message reporting the current usage & desired limit. This is considered best-effort | ||||||
since it is still subject to a time-of-check-time-of-use (TOCTOU) race condition where the usage exceeds the limit after the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we incorporate some timeout on this? What if we declare some mechanism to push an app to free up resources (some kind of hungry baloon process inside the Pod cgroup), or some kind of a signal to the Pod. Once we will have this mechanism, we will need a delay in applying the resize. So I wonder if we should assume that in future the mechanism like this will exist and we need to account for a timeout that will be introduced. It also will be easier if we will introduce a new CRI call for this. Something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess the questions here are:
In any case, I believe that introducing some sort of timeout for the memory shrink operation will be very useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good questions, @SergeyKanzhelev. The root of the problem here is that shrinking memory can be a longer running operation, and we need to decide whether the Kubelet will control the operation, or if the runtime will handle it (preferably asynchronously). IMO, I think the long-term direction of this should be that the CRI resize call ( Independently of memory limit decrease, setting a timeout on the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the operation, even if asynchronous, must have some deadline on completion and a way to cancel/revert by the kubelet. I would think a bit more design is needed in the KEP text to explain the model we will pursue. If we are going with what is written, how we will transition to the new model (new CRI API? change existing APIs semantics?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As for immediate implementation - maybe kubelet needs to query metrics a few times to account for the metrics update interval. Or we are force-querying memory? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We don't yet know the model we will pursue, hence explicitly stating that this is best-effort and we will improve it over time. This leaves the door open for whichever approach we land on.
Good point about cached metrics. I was picturing force-querying memory. There's a related effort to increase the resolution of container metrics for more responsive autoscaling. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why would a gradual decrease by a few MB help and not by a few KB? A decrease in KBs could be enough to OOMkill, it depends on the app , situation I believe. Overall comment , scale down is an app owner action, thus app owners should ensure requests will not end up to OOM by adding a ”safety net” which only them know what this is at this specific time ( past usage , forecasting model, app source code ) I can argue that a similar issue could also apply with cpu limits, if cpu is scaled down to a low number, how do we know this will not result to failing apps liveness probes ? We don’t it depends on the app. Thus potentially a CPU limit scale down could also result to a pod restart depending on the app. I respect the efforts done in this KEP , and I am in favor to bring back memory scale down. I think we should accept the risks , document them and move on instead of trying to solve all corner cases in kubelet , it is the app owner who needs to think twice when resizing below a certain limit. On the other hand, perhaps this ”limitation” is useful for other users, because it allows an app owner to force OOMkill , if needs to, with an intentional small memory scale down. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, the theory here is that gradual decrease will create a memory pressure situation that will result in kernel and maybe the app to gradually decrease usage rather than suddenly being OOMKilled. The size of chunks may need to be dependent on the page size.
Absolutely, and the API here as well as with memory is - you can safely decrease to some minimum amount you have specified. So the app owner is responsible to configure some minimum and ensure the app works with it (for example in VPA). And k8s is responsible to safely shrink to this minimum.
The whole discussion on how to make memory decrease safe is due to the decision we made on API surface. We declared that k8s will do it's absolute best to scale down. And it will never force scale down if k8s knows it will cause the restart. We discussed the need of a "force change" API, but haven't added it. If we need it based on user feedback, we should reopen the conversation (maybe a separate KEP). So the argument - this is user responsibility is not working here as we are implementing API we declared, not the API of "best effort resize". |
||||||
check is performed. A similar check will also be performed at the pod level before lowering the pod | ||||||
cgroup memory limit. | ||||||
|
||||||
Memory limit decreases with `RestartRequired` are still allowed. | ||||||
_Version skew note:_ Kubernetes v1.33 (and earlier) nodes only check the pod-level memory usage. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this suggesting that pod-level memory limit decrease was already supported in 1.33 for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original implementation had this support with the pod-level check. In v1.33 we prevented memory limit decreases via API validation, but didn't put anything in the node to prevent it. So, if we relax the apiserver validation in v1.34, old nodes will be able to decrease the memory limit but without much in the way of protections. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it make sense to move more complex logic elsewhere? I think it would be better to have mem resizing handling in VPA using some mem resizing configs (to be agreed what that means) and apply them to a workload in form of options in VPA Resource Policy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with having more complex logic living in controllers such as VPA makes sense, but VPA still won't be able to remove memory limits if we don't relax the restriction in the apiserver. We need to make sure what we do here is reasonable for anyone who may be attempting to remove memory limits even outside the context of VPA or other controllers. The proposal in this KEP seems to me as the most simple thing we can possibly do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it! Thank you for confirmation that we will use VPA for more advanced mem resizing use cases. It would be great to solve this limitation as part of k8s 1.34 to have mem limit resizing in VPA IPPR (targeting k8s 1.34) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I see. 1.33 nodes will not perform container-level memory check before reducing the memory limit, and may result in OOM kill if the target limit is below the current usage. LGTM given the limit reduction operation is best-effort. |
||||||
|
||||||
### Swap | ||||||
|
||||||
|
@@ -891,7 +894,8 @@ This will be reconsidered post-beta as a future enhancement. | |||||
|
||||||
### Future Enhancements | ||||||
|
||||||
1. Allow memory limits to be decreased, and handle the case where limits are set below usage. | ||||||
1. Improve memory limit decrease oom-kill prevention by leveraging other kernel mechanisms or using | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are current Kubernetes minimum Kernel version requirements sufficient for the kernel mechanisms you are referring to ? Which kernel mechanisms do you plan to leverage ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Still TBD. The initial proposal was to leverage cc @roycaihw There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We will update the kernel version requirement public doc based on the kernel features we use. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for clarifying ! /lgtm |
||||||
gradual decreaese. | ||||||
1. Kubelet (or Scheduler) evicts lower priority Pods from Node to make room for | ||||||
resize. Pre-emption by Kubelet may be simpler and offer lower latencies. | ||||||
1. Allow ResizePolicy to be set on Pod level, acting as default if (some of) | ||||||
|
@@ -1546,6 +1550,8 @@ _This section must be completed when targeting beta graduation to a release._ | |||||
and update CRI `UpdateContainerResources` contract | ||||||
- Add back `AllocatedResources` field to resolve a scheduler corner case | ||||||
- Introduce Actuated resources for actuation | ||||||
- 2025-06-03 - v1.34 post-beta updates | ||||||
- Allow no-restart memory limit decreases | ||||||
|
||||||
## Drawbacks | ||||||
|
||||||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.