Skip to content

KEP-1287: Allow memory limit decreases #5368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 23, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 14 additions & 8 deletions keps/sig-node/1287-in-place-update-pod-resources/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -390,9 +390,6 @@ WindowsPodSandboxConfig.
configurations will see values that may not represent actual configurations. As a
mitigation, this change needs to be documented and highlighted in the
release notes, and in top-level Kubernetes documents.
1. Resizing memory lower: Lowering cgroup memory limits may not work as pages
could be in use, and approaches such as setting limit near current usage may
be required. This issue needs further investigation.
1. Scheduler race condition: If a resize happens concurrently with the scheduler evaluating the node
where the pod is resized, it can result in a node being over-scheduled, which will cause the pod
to be rejected with an `OutOfCPU` or `OutOfMemory` error. Solving this race condition is out of
Expand Down Expand Up @@ -810,11 +807,17 @@ Setting the memory limit below current memory usage can cause problems. If the k
sufficient memory, the outcome depends on the cgroups version. With cgroups v1 the change will
simply be rejected by the kernel, whereas with cgroups v2 it will trigger an oom-kill.

In the initial beta release of in-place resize, we will **disallow** `PreferNoRestart` memory limit
decreases, enforced through API validation. The intent is for this restriction to be relaxed in the
future, but the design of how limit decreases will be approached is still undecided.
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
Copy link

@esotsal esotsal Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
If the memory resize restart policy is specified 'NotRequired' ( if not specified, it defaults to `NotRequired` ), the Kubelet will make a

**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
**best-effort** attempt to prevent oom-kills when decreasing memory limits below container used memory, but doesn't provide any

guarantees. Before decreasing container memory limits, the Kubelet will read the container memory
usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this happens, would it try again in the next sync loop iteration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. @tallclair Please correct me if I'm wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the runtime would be better for this, or even OCI runtime,as it closes the TOCTOU window. There could be a specified CRI error that could be bubbled up that the kubelet could respond with

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I'm hesitant to plumb this down into the runtime before we have a better understanding of what the long-term direction is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the alternatives besides reverting the behavior change in the kernel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leveraging memory.high? @roycaihw any other mechanisms under consideration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use memory.high and memory.reclaim to bring down the memory usage more safely.

  • memory.reclaim allows us to proactively reclaim memory, to bring the usage down towards our target if needed.
  • memory.high sets a soft limit without OOM killing by doing proactive memory reclaim and even throttling the process, allowing us to take further actions (bringing down memory.max if the process is stable, or reverting memory.high if the process suffers from throttling)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this would be a better fit for the CRI runtime layer than runc. @chrishenzie @samuelkarp @mikebrow can any of you chime in for containerd, and whether this approach seems viable?

Would it make sense to implement this in a shared library that cri-o & containerd could both use?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really OCI runtime would be best as that's closest, which can be considered a shared layer. If the OCI runtime doesn't support it CRI could do it as a backup, using runtime features to figure it out maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the CRI API supports UpdateContainerResources (including the memory limit) this seems like an appropriate place for this logic. I'm not sure if the OCI runtime is an appropriate place for this logic. Is it typically responsible for performing these kinds of checks before updating containers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I'm worried about skew between enabling memory limit decreases, and getting the protection logic into the runtime (either OCI or CRI layer). This leaves 2 paths forward:

  1. work on getting it into the runtime, and only lift the k8s restriction once there is sufficient time for the runtime protections to roll out
  2. proceed with the protections implemented in the kubelet, and in parallel work on adding the runtime-layer protections. Once the runtime mechanisms have been in place for sufficiently long to roll out, we can remove the Kubelet mechanisms.

The restriction on memory limit decrease is an overly strict limitation that constrains usage of IPPR, especially for guaranteed pods. I'm worried that the first approach (wait for runtime implementation) will delay this too long, so I propose we go with the second option. I still agree that we ultimately want to implement this in the runtime, but I think we should put the basic check in the Kubelet to unblock this feature.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents.

We can use memory.high and memory.reclaim to bring down the memory usage more safely.

memory.high sets a soft limit without OOM killing

@roycaihw Using memory.reclaim is an interesting idea worth exploring IMO.
However, using memory.high seems problematic to me. In cases where not enough memory can be reclaimed, processes would not be able to operate, probably anyways leading to a crash, but that would be much harder to reason about.

I still agree that we ultimately want to implement this in the runtime, but I think we should put the basic check in the Kubelet to unblock this feature.

@tallclair 👍
Performing the check in the CRI/OCI level is an optimization, but +1 for not blocking this feature over this. I'm not sure if the difference between the kubelet-check and CRI/OCI check is very significant in practice.

skipped for that container. The pod condition `PodResizeInProgress` will remain, with an `Error`
reason, and a message reporting the current usage & desired limit. This is considered best-effort
since it is still subject to a time-of-check-time-of-use (TOCTOU) race condition where the usage exceeds the limit after the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we incorporate some timeout on this? What if we declare some mechanism to push an app to free up resources (some kind of hungry baloon process inside the Pod cgroup), or some kind of a signal to the Pod. Once we will have this mechanism, we will need a delay in applying the resize. So I wonder if we should assume that in future the mechanism like this will exist and we need to account for a timeout that will be introduced.

It also will be easier if we will introduce a new CRI call for this. Something like PrepareForMemoryDecrease().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the questions here are:

  1. Should container runtime get a prepare for resize signal?
  2. Whenever we have a mechanism to tell an app or runtime that we need to downscale, will we be able to introduce the timeout mechanism easily?
  3. Should we investigate techniques like gradual decrease by a few Mb with the above mentioned checks so we can create a pressure for an app without creating the OOMkill in a single memory size decrease?

In any case, I believe that introducing some sort of timeout for the memory shrink operation will be very useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions, @SergeyKanzhelev.

The root of the problem here is that shrinking memory can be a longer running operation, and we need to decide whether the Kubelet will control the operation, or if the runtime will handle it (preferably asynchronously).

IMO, I think the long-term direction of this should be that the CRI resize call (UpdateContainerResources) is asynchronous. Currently the Kubelet assumes that if the call succeeds, then the resize is complete. The big blocker to making the call asynchronous is the PLEG needs a signal or indicator that the resize is complete. This is covered by kubernetes/kubernetes#131112, but needs to work for the generic PLEG too.

Independently of memory limit decrease, setting a timeout on the UpdateContainerResources call seems prudent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the operation, even if asynchronous, must have some deadline on completion and a way to cancel/revert by the kubelet. I would think a bit more design is needed in the KEP text to explain the model we will pursue. If we are going with what is written, how we will transition to the new model (new CRI API? change existing APIs semantics?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for immediate implementation - maybe kubelet needs to query metrics a few times to account for the metrics update interval. Or we are force-querying memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think a bit more design is needed in the KEP text to explain the model we will pursue

We don't yet know the model we will pursue, hence explicitly stating that this is best-effort and we will improve it over time. This leaves the door open for whichever approach we land on.

As for immediate implementation - maybe kubelet needs to query metrics a few times to account for the metrics update interval. Or we are force-querying memory?

Good point about cached metrics. I was picturing force-querying memory. There's a related effort to increase the resolution of container metrics for more responsive autoscaling.

Copy link

@esotsal esotsal Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Should we investigate techniques like gradual decrease by a few Mb with the above mentioned checks so we can create a pressure for an app without creating the OOMkill in a single memory size decrease?

Why would a gradual decrease by a few MB help and not by a few KB? A decrease in KBs could be enough to OOMkill, it depends on the app , situation I believe.

Overall comment , scale down is an app owner action, thus app owners should ensure requests will not end up to OOM by adding a ”safety net” which only them know what this is at this specific time ( past usage , forecasting model, app source code )

I can argue that a similar issue could also apply with cpu limits, if cpu is scaled down to a low number, how do we know this will not result to failing apps liveness probes ? We don’t it depends on the app. Thus potentially a CPU limit scale down could also result to a pod restart depending on the app.

I respect the efforts done in this KEP , and I am in favor to bring back memory scale down.

I think we should accept the risks , document them and move on instead of trying to solve all corner cases in kubelet , it is the app owner who needs to think twice when resizing below a certain limit.

On the other hand, perhaps this ”limitation” is useful for other users, because it allows an app owner to force OOMkill , if needs to, with an intentional small memory scale down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we investigate techniques like gradual decrease by a few Mb with the above mentioned checks so we can create a pressure for an app without creating the OOMkill in a single memory size decrease?

Why would a gradual decrease by a few MB help and not by a few KB? A decrease in KBs could be enough to OOMkill, it depends on the app , situation I believe.

Yes, the theory here is that gradual decrease will create a memory pressure situation that will result in kernel and maybe the app to gradually decrease usage rather than suddenly being OOMKilled. The size of chunks may need to be dependent on the page size.

I can argue that a similar issue could also apply with cpu limits, if cpu is scaled down to a low number, how do we know this will not result to failing apps liveness probes ? We don’t it depends on the app. Thus potentially a CPU limit scale down could also result to a pod restart depending on the app.

Absolutely, and the API here as well as with memory is - you can safely decrease to some minimum amount you have specified. So the app owner is responsible to configure some minimum and ensure the app works with it (for example in VPA). And k8s is responsible to safely shrink to this minimum.

I think we should accept the risks , document them and move on instead of trying to solve all corner cases in kubelet , it is the app owner who needs to think twice when resizing below a certain limit.

On the other hand, perhaps this ”limitation” is useful for other users, because it allows an app owner to force OOMkill , if needs to, with an intentional small memory scale down.

The whole discussion on how to make memory decrease safe is due to the decision we made on API surface. We declared that k8s will do it's absolute best to scale down. And it will never force scale down if k8s knows it will cause the restart. We discussed the need of a "force change" API, but haven't added it. If we need it based on user feedback, we should reopen the conversation (maybe a separate KEP).

So the argument - this is user responsibility is not working here as we are implementing API we declared, not the API of "best effort resize".

check is performed. A similar check will also be performed at the pod level before lowering the pod
cgroup memory limit.

Memory limit decreases with `RestartRequired` are still allowed.
_Version skew note:_ Kubernetes v1.33 (and earlier) nodes only check the pod-level memory usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this suggesting that pod-level memory limit decrease was already supported in 1.33 for NotRequired policy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation had this support with the pod-level check. In v1.33 we prevented memory limit decreases via API validation, but didn't put anything in the node to prevent it. So, if we relax the apiserver validation in v1.34, old nodes will be able to decrease the memory limit but without much in the way of protections.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move more complex logic elsewhere? I think it would be better to have mem resizing handling in VPA using some mem resizing configs (to be agreed what that means) and apply them to a workload in form of options in VPA Resource Policy.

Copy link
Contributor

@natasha41575 natasha41575 Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with having more complex logic living in controllers such as VPA makes sense, but VPA still won't be able to remove memory limits if we don't relax the restriction in the apiserver. We need to make sure what we do here is reasonable for anyone who may be attempting to remove memory limits even outside the context of VPA or other controllers. The proposal in this KEP seems to me as the most simple thing we can possibly do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thank you for confirmation that we will use VPA for more advanced mem resizing use cases.

It would be great to solve this limitation as part of k8s 1.34 to have mem limit resizing in VPA IPPR (targeting k8s 1.34)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation had this support with the pod-level check. In v1.33 we prevented memory limit decreases via API validation, but didn't put anything in the node to prevent it. So, if we relax the apiserver validation in v1.34, old nodes will be able to decrease the memory limit but without much in the way of protections.

I see. 1.33 nodes will not perform container-level memory check before reducing the memory limit, and may result in OOM kill if the target limit is below the current usage. LGTM given the limit reduction operation is best-effort.


### Swap

Expand Down Expand Up @@ -891,7 +894,8 @@ This will be reconsidered post-beta as a future enhancement.

### Future Enhancements

1. Allow memory limits to be decreased, and handle the case where limits are set below usage.
1. Improve memory limit decrease oom-kill prevention by leveraging other kernel mechanisms or using
Copy link

@esotsal esotsal Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are current Kubernetes minimum Kernel version requirements sufficient for the kernel mechanisms you are referring to ? Which kernel mechanisms do you plan to leverage ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still TBD. The initial proposal was to leverage memory.high, but there might be some problems when combining that with memory.max. Further exploration is needed, which is why I left this open ended.

cc @roycaihw

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will update the kernel version requirement public doc based on the kernel features we use.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying !

/lgtm

gradual decreaese.
1. Kubelet (or Scheduler) evicts lower priority Pods from Node to make room for
resize. Pre-emption by Kubelet may be simpler and offer lower latencies.
1. Allow ResizePolicy to be set on Pod level, acting as default if (some of)
Expand Down Expand Up @@ -1546,6 +1550,8 @@ _This section must be completed when targeting beta graduation to a release._
and update CRI `UpdateContainerResources` contract
- Add back `AllocatedResources` field to resolve a scheduler corner case
- Introduce Actuated resources for actuation
- 2025-06-03 - v1.34 post-beta updates
- Allow no-restart memory limit decreases

## Drawbacks

Expand Down