KEP-3953: Node Resource Hot Plug #3955

Karthik-K-N · 2023-04-17T08:24:44Z

One-line PR description: Node Resource Hot Plug

Issue link: Node Resource Hot Plug #3953

Other comments:

bart0sh · 2023-04-17T19:49:59Z

/assign @mrunalp @SergeyKanzhelev @klueska

kad · 2023-04-28T16:15:22Z

/cc

ffromani · 2023-05-18T07:29:48Z

/cc

keps/sig-node/3953-dynamic-node-resize/README.md

fmuyassarov · 2023-05-25T11:59:23Z

/cc

keps/sig-node/3953-dynamic-node-resize/README.md

Karthik-K-N · 2025-04-14T05:56:50Z

Hi All, We have addressed comments and concerns, and have added more details to KEP

Note: One pending issue needs opinion is regarding updating oom_score_adj for a container which is currently being discussed here opencontainers/runc#4669

Please let us know your thoughts
Thanks

towca · 2025-04-25T17:45:31Z

Is Node autoscaling a consideration for this feature, or do you expect for it to only be used on Nodes that aren't expected to be horizontally autoscaled?

Cluster Autoscaler uses existing Nodes as templates for how a new Node from the same group would look like. How should it treat Nodes which were affected by hot plugging? Can we distinguish which Nodes were affected?

In any case, if the feature is meant to be used with Cluster Autoscaler and if it can result in Nodes from the same NodeGroup having different allocatable values, it will require making CA aware of the hotplugging. Otherwise CA will just pick a random Node from the group and assume that all new Nodes will have the same allocatable value - which will lead to bad decisions (repeatedly provisioning a Node that a pending Pod can't actually fit on, or not provisioning a Node that could actually help a pending Pod).

j4ckstraw · 2025-04-27T02:06:38Z

FYI, when resource decreased and with kubelet restart, it will check if kubelet can admit the pod, TheOutOfxxx error will be encountered. refer https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/lifecycle/predicate.go#L213

Karthik-K-N · 2025-04-29T12:58:22Z

FYI, when resource decreased and with kubelet restart, it will check if kubelet can admit the pod, TheOutOfxxx error will be encountered. refer https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/lifecycle/predicate.go#L213

Yeah, Thank you for bringing this point and though this KEP primarily focuses on resource hot plug and some level of attention is also given for hot unplug like setting the node as not ready incase of hot unplug.
This can be further refined in a separate KEP which focuses on hot unplug scenarios.

Karthik-K-N · 2025-04-29T13:20:33Z

Is Node autoscaling a consideration for this feature, or do you expect for it to only be used on Nodes that aren't expected to be horizontally autoscaled?

We intend to use this feature on all cluster irrespective of their autoscaling capability.

Cluster Autoscaler uses existing Nodes as templates for how a new Node from the same group would look like. How should it treat Nodes which were affected by hot plugging? Can we distinguish which Nodes were affected?

In any case, if the feature is meant to be used with Cluster Autoscaler and if it can result in Nodes from the same NodeGroup having different allocatable values, it will require making CA aware of the hotplugging. Otherwise CA will just pick a random Node from the group and assume that all new Nodes will have the same allocatable value - which will lead to bad decisions (repeatedly provisioning a Node that a pending Pod can't actually fit on, or not provisioning a Node that could actually help a pending Pod).

Can we make autoscaler aware of hotplug capabilities? If not any recommendation/opinion on the kind of API that can make the autoscaler aware of change in resource capacity.

keps/sig-node/3953-node-resource-hot-plug/README.md

marquiz · 2025-05-05T14:56:03Z

keps/sig-node/3953-node-resource-hot-plug/README.md

+- Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload.
+  - To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.
+
+- Lack of coordination about change in resource availability across kubelet/runtime/plugins.


What plugins are you referring to? Device plugins or something else. Might consider clarification here

oh yeah, I meant to say NRI plugins, Let me explicitly specify that.

keps/sig-node/3953-node-resource-hot-plug/README.md

marquiz · 2025-05-07T07:39:15Z

keps/sig-node/3953-node-resource-hot-plug/README.md

+      or if it has to be terminated due to resource crunch.
+* Recalculate OOM adjust score and Swap limits:
+    * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
+* Handling unplug of reserved CPUs.


Do we capture the corner case where node capacity would increase but some of the resources go away, e.g. new cpuset is bigger but doesn't contain the old cpuset (dumb example 0-3 -> 2-9). CPU manager sync returns an error and node goes to not ready?

So for now we thought to only record an event during resync failures. may be need to check how we can identify what was the assigned cpu set before hotplug.

keps/sig-node/3953-node-resource-hot-plug/README.md

esotsal · 2025-05-09T18:58:50Z

/cc

keps/sig-node/3953-node-resource-hot-plug/README.md

esotsal · 2025-05-13T13:29:26Z

keps/sig-node/3953-node-resource-hot-plug/README.md

+
+## Glossary
+
+Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)


Suggested change

Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)

Node Compute Resource Hot Plug: Dynamically add node compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)

We inclined towards keeping hotplug in glossary.

esotsal · 2025-05-13T13:29:34Z

keps/sig-node/3953-node-resource-hot-plug/README.md

+
+Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)
+
+Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)


Suggested change

Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)

Node Compute Resource Hot Unplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) from the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)

esotsal · 2025-05-13T13:30:39Z

keps/sig-node/3953-node-resource-hot-plug/README.md

+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Handling hotplug events](#handling-hotplug-events)
+    - [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers)


Since Swap is mentioned in this document and since Swap is supported only with cgroup v2. Does this mean intention for this KEP is not to be supported for cgroup v1 or ?

technically speaking cgroupv1 is feature frozen, but I think we'll accidentally have support for cgroupv1 unless we explicitly skip the detection on v1

keps/sig-node/3953-node-resource-hot-plug/README.md

esotsal · 2025-05-13T13:48:18Z

keps/sig-node/3953-node-resource-hot-plug/README.md

+
+* Achieve seamless node capacity expansion through hot plugging resources.
+* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager to accommodate alterations in the node's resource allocation.
+* Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods.


Do we need to add something about huge pages ?

Any implications for Nodes using DPDK / SRIOV with relation to CNIs which we need to consider ?

Karthik-K-N · 2025-05-16T13:22:14Z

Is Node autoscaling a consideration for this feature, or do you expect for it to only be used on Nodes that aren't expected to be horizontally autoscaled?

Cluster Autoscaler uses existing Nodes as templates for how a new Node from the same group would look like. How should it treat Nodes which were affected by hot plugging? Can we distinguish which Nodes were affected?

In any case, if the feature is meant to be used with Cluster Autoscaler and if it can result in Nodes from the same NodeGroup having different allocatable values, it will require making CA aware of the hotplugging. Otherwise CA will just pick a random Node from the group and assume that all new Nodes will have the same allocatable value - which will lead to bad decisions (repeatedly provisioning a Node that a pending Pod can't actually fit on, or not provisioning a Node that could actually help a pending Pod).

Based on the conversation over slack and here, Added a section about compatability with Cluster Autoscaler: https://github.com/Karthik-K-N/enhancements/tree/node-resize/keps/sig-node/3953-node-resource-hot-plug#compatability-with-cluster-autoscaler.
Please take a look. Thanks.

towca · 2025-05-29T14:26:19Z

Based on the conversation over slack and here, Added a section about compatability with Cluster Autoscaler: https://github.com/Karthik-K-N/enhancements/tree/node-resize/keps/sig-node/3953-node-resource-hot-plug#compatability-with-cluster-autoscaler.
Please take a look. Thanks.

Thank you, the added section accurately captures the problem and possible solutions. Could you add some information on when you want to address this part? When going to beta with the feature?

sanposhiho

Revisiting here, as far as I remember at the previous discussion somewhere, no change is required in the scheduler, is it still true?
If not, or if you're not sure, please make sure adding someone from sig-scheduling to review it.

Karthik-K-N · 2025-05-30T03:57:54Z

Revisiting here, as far as I remember at the previous discussion somewhere, no change is required in the scheduler, is it still true? If not, or if you're not sure, please make sure adding someone from sig-scheduling to review it.

Yes, still it stands the same, No changes are required from scheduler, As you mentioned in previous review I have updated the KEP design details to include the phrase

    Scheduler will automatically schedule any pending pods.
    This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler watches the available capacity of the node and creates pods accordingly.

For more reference our previous discussion: Slack, Enhancement

sanposhiho · 2025-05-30T08:33:41Z

Sure, as long as that part isn't changed, I believe it's ok from sig-scheduling side.

keps/sig-node/3953-node-resource-hot-plug/README.md

Karthik-K-N · 2025-05-30T09:03:00Z

Sure, as long as that part isn't changed, I believe it's ok from sig-scheduling side.

yeah, I will reach out, if there are any changes in future, Thank you for taking a look.

Co-authored-by: kishen-v <[email protected]>

Karthik-K-N · 2025-05-30T09:28:46Z

Based on the conversation over slack and here, Added a section about compatability with Cluster Autoscaler: https://github.com/Karthik-K-N/enhancements/tree/node-resize/keps/sig-node/3953-node-resource-hot-plug#compatability-with-cluster-autoscaler.
Please take a look. Thanks.

Thank you, the added section accurately captures the problem and possible solutions. Could you add some information on when you want to address this part? When going to beta with the feature?

Thanks for the review, We plan to provide compatibility with autoscaler by storing the initial node resource in node object and also plan to provide this feature during beta graduation so we can get started now. But we are open for community feedback. Thank you.

towca · 2025-05-30T17:02:28Z

Thanks for the review, We plan to provide compatibility with autoscaler by storing the initial node resource in node object and also plan to provide this feature during beta graduation so we can get started now. But we are open for community feedback. Thank you.

Sounds good to me, thanks!

elmiko · 2025-05-30T19:24:30Z

gave a read of the kep, no comments or suggestions currently, but happy to see this work progressing.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 17, 2023

k8s-ci-robot requested review from dchen1107 and derekwaynecarr April 17, 2023 08:24

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 17, 2023

Karthik-K-N mentioned this pull request Apr 17, 2023

Node Resource Hot Plug #3953

Open

4 tasks

Karthik-K-N force-pushed the node-resize branch from a7bc843 to 03e927f Compare April 17, 2023 08:27

Karthik-K-N changed the title ~~Dynamic node resize~~ KEP-3953: Dynamic node resize Apr 17, 2023

k8s-ci-robot assigned klueska, mrunalp and SergeyKanzhelev Apr 17, 2023

k8s-ci-robot requested a review from kad April 28, 2023 16:15

Karthik-K-N force-pushed the node-resize branch from 03e927f to 0c0214a Compare May 17, 2023 12:49

k8s-ci-robot requested a review from ffromani May 18, 2023 07:29

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 22, 2023

pacoxu reviewed May 23, 2023

View reviewed changes

keps/sig-node/3953-dynamic-node-resize/README.md Outdated Show resolved Hide resolved

pacoxu reviewed May 23, 2023

View reviewed changes

keps/sig-node/3953-dynamic-node-resize/README.md Outdated Show resolved Hide resolved

Karthik-K-N force-pushed the node-resize branch from 0c0214a to d98d71a Compare May 23, 2023 12:54

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 23, 2023

k8s-ci-robot requested a review from fmuyassarov May 25, 2023 11:59

Karthik-K-N force-pushed the node-resize branch from d98d71a to f6aadcc Compare July 25, 2023 14:36

44past4 mentioned this pull request Sep 22, 2023

Support node allocatable and capacity managed by external controller kubernetes/kubernetes#120833

Closed

sftim reviewed Sep 23, 2023

View reviewed changes

keps/sig-node/3953-dynamic-node-resize/README.md Outdated Show resolved Hide resolved

marquiz reviewed May 7, 2025

View reviewed changes

Address review comments

d9f4919

k8s-ci-robot requested a review from esotsal May 9, 2025 18:58

esotsal reviewed May 13, 2025

View reviewed changes

keps/sig-node/3953-node-resource-hot-plug/README.md Show resolved Hide resolved

esotsal reviewed May 13, 2025

View reviewed changes

keps/sig-node/3953-node-resource-hot-plug/README.md Outdated Show resolved Hide resolved

esotsal reviewed May 13, 2025

View reviewed changes

Karthik-K-N added 2 commits May 14, 2025 12:13

Address review comments

6c44c81

Add CA compatability section

348993b

Update OOMScoreAdj formula

429ab5b

Karthik-K-N force-pushed the node-resize branch from e58fe4d to 429ab5b Compare May 21, 2025 13:06

sanposhiho reviewed May 29, 2025

View reviewed changes

sanposhiho reviewed May 30, 2025

View reviewed changes

keps/sig-node/3953-node-resource-hot-plug/README.md Outdated Show resolved Hide resolved

Address reveiw comments

4f120bb

Co-authored-by: kishen-v <[email protected]>


		## Glossary

		Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)

	Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)
	Node Compute Resource Hot Plug: Dynamically add node compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)


		Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)

		Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)

	Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)
	Node Compute Resource Hot Unplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) from the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)

KEP-3953: Node Resource Hot Plug #3955

Are you sure you want to change the base?

KEP-3953: Node Resource Hot Plug #3955

Uh oh!

Conversation

Karthik-K-N commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bart0sh commented Apr 17, 2023

Uh oh!

kad commented Apr 28, 2023

Uh oh!

ffromani commented May 18, 2023

Uh oh!

Uh oh!

Uh oh!

fmuyassarov commented May 25, 2023

Uh oh!

Uh oh!

Karthik-K-N commented Apr 14, 2025

Uh oh!

towca commented Apr 25, 2025

Uh oh!

j4ckstraw commented Apr 27, 2025

Uh oh!

Karthik-K-N commented Apr 29, 2025

Uh oh!

Karthik-K-N commented Apr 29, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

esotsal commented May 9, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Karthik-K-N commented May 16, 2025

Uh oh!

towca commented May 29, 2025

Uh oh!

sanposhiho left a comment

Choose a reason for hiding this comment

Uh oh!

Karthik-K-N commented May 30, 2025

Uh oh!

sanposhiho commented May 30, 2025

Uh oh!

Uh oh!

Karthik-K-N commented May 30, 2025

Uh oh!

Karthik-K-N commented May 30, 2025

Uh oh!

towca commented May 30, 2025

Uh oh!

elmiko commented May 30, 2025

Uh oh!

Uh oh!

Karthik-K-N commented Apr 17, 2023 •

edited

Loading