Skip to content

KEP-3953: Node Resource Hot Plug #3955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

Karthik-K-N
Copy link

@Karthik-K-N Karthik-K-N commented Apr 17, 2023

  • One-line PR description: Node Resource Hot Plug
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 17, 2023
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 17, 2023
@Karthik-K-N Karthik-K-N mentioned this pull request Apr 17, 2023
4 tasks
@Karthik-K-N Karthik-K-N changed the title Dynamic node resize KEP-3953: Dynamic node resize Apr 17, 2023
@bart0sh
Copy link
Contributor

bart0sh commented Apr 17, 2023

/assign @mrunalp @SergeyKanzhelev @klueska

@kad
Copy link
Member

kad commented Apr 28, 2023

/cc

@ffromani
Copy link
Contributor

/cc

@k8s-ci-robot k8s-ci-robot requested a review from ffromani May 18, 2023 07:29
@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 22, 2023
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 23, 2023
@fmuyassarov
Copy link
Member

/cc

@Karthik-K-N
Copy link
Author

Hi All, We have addressed comments and concerns, and have added more details to KEP

Note: One pending issue needs opinion is regarding updating oom_score_adj for a container which is currently being discussed here opencontainers/runc#4669

Please let us know your thoughts
Thanks

@towca
Copy link

towca commented Apr 25, 2025

Is Node autoscaling a consideration for this feature, or do you expect for it to only be used on Nodes that aren't expected to be horizontally autoscaled?

Cluster Autoscaler uses existing Nodes as templates for how a new Node from the same group would look like. How should it treat Nodes which were affected by hot plugging? Can we distinguish which Nodes were affected?

In any case, if the feature is meant to be used with Cluster Autoscaler and if it can result in Nodes from the same NodeGroup having different allocatable values, it will require making CA aware of the hotplugging. Otherwise CA will just pick a random Node from the group and assume that all new Nodes will have the same allocatable value - which will lead to bad decisions (repeatedly provisioning a Node that a pending Pod can't actually fit on, or not provisioning a Node that could actually help a pending Pod).

@j4ckstraw
Copy link

FYI, when resource decreased and with kubelet restart, it will check if kubelet can admit the pod, TheOutOfxxx error will be encountered. refer https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/lifecycle/predicate.go#L213

@Karthik-K-N
Copy link
Author

FYI, when resource decreased and with kubelet restart, it will check if kubelet can admit the pod, TheOutOfxxx error will be encountered. refer https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/lifecycle/predicate.go#L213

Yeah, Thank you for bringing this point and though this KEP primarily focuses on resource hot plug and some level of attention is also given for hot unplug like setting the node as not ready incase of hot unplug.
This can be further refined in a separate KEP which focuses on hot unplug scenarios.

@Karthik-K-N
Copy link
Author

Is Node autoscaling a consideration for this feature, or do you expect for it to only be used on Nodes that aren't expected to be horizontally autoscaled?

We intend to use this feature on all cluster irrespective of their autoscaling capability.

Cluster Autoscaler uses existing Nodes as templates for how a new Node from the same group would look like. How should it treat Nodes which were affected by hot plugging? Can we distinguish which Nodes were affected?

In any case, if the feature is meant to be used with Cluster Autoscaler and if it can result in Nodes from the same NodeGroup having different allocatable values, it will require making CA aware of the hotplugging. Otherwise CA will just pick a random Node from the group and assume that all new Nodes will have the same allocatable value - which will lead to bad decisions (repeatedly provisioning a Node that a pending Pod can't actually fit on, or not provisioning a Node that could actually help a pending Pod).

Can we make autoscaler aware of hotplug capabilities? If not any recommendation/opinion on the kind of API that can make the autoscaler aware of change in resource capacity.

- Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload.
- To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.

- Lack of coordination about change in resource availability across kubelet/runtime/plugins.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What plugins are you referring to? Device plugins or something else. Might consider clarification here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, I meant to say NRI plugins, Let me explicitly specify that.

or if it has to be terminated due to resource crunch.
* Recalculate OOM adjust score and Swap limits:
* Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
* Handling unplug of reserved CPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we capture the corner case where node capacity would increase but some of the resources go away, e.g. new cpuset is bigger but doesn't contain the old cpuset (dumb example 0-3 -> 2-9). CPU manager sync returns an error and node goes to not ready?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for now we thought to only record an event during resync failures. may be need to check how we can identify what was the assigned cpu set before hotplug.

@esotsal
Copy link

esotsal commented May 9, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from esotsal May 9, 2025 18:58

## Glossary

Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)
Node Compute Resource Hot Plug: Dynamically add node compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We inclined towards keeping hotplug in glossary.


Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)

Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)
Node Compute Resource Hot Unplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) from the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)

- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Handling hotplug events](#handling-hotplug-events)
- [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Swap is mentioned in this document and since Swap is supported only with cgroup v2. Does this mean intention for this KEP is not to be supported for cgroup v1 or ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically speaking cgroupv1 is feature frozen, but I think we'll accidentally have support for cgroupv1 unless we explicitly skip the detection on v1


* Achieve seamless node capacity expansion through hot plugging resources.
* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager to accommodate alterations in the node's resource allocation.
* Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add something about huge pages ?

Any implications for Nodes using DPDK / SRIOV with relation to CNIs which we need to consider ?

@Karthik-K-N
Copy link
Author

Is Node autoscaling a consideration for this feature, or do you expect for it to only be used on Nodes that aren't expected to be horizontally autoscaled?

Cluster Autoscaler uses existing Nodes as templates for how a new Node from the same group would look like. How should it treat Nodes which were affected by hot plugging? Can we distinguish which Nodes were affected?

In any case, if the feature is meant to be used with Cluster Autoscaler and if it can result in Nodes from the same NodeGroup having different allocatable values, it will require making CA aware of the hotplugging. Otherwise CA will just pick a random Node from the group and assume that all new Nodes will have the same allocatable value - which will lead to bad decisions (repeatedly provisioning a Node that a pending Pod can't actually fit on, or not provisioning a Node that could actually help a pending Pod).

Based on the conversation over slack and here, Added a section about compatability with Cluster Autoscaler: https://github.com/Karthik-K-N/enhancements/tree/node-resize/keps/sig-node/3953-node-resource-hot-plug#compatability-with-cluster-autoscaler.
Please take a look. Thanks.

@towca
Copy link

towca commented May 29, 2025

Based on the conversation over slack and here, Added a section about compatability with Cluster Autoscaler: https://github.com/Karthik-K-N/enhancements/tree/node-resize/keps/sig-node/3953-node-resource-hot-plug#compatability-with-cluster-autoscaler.
Please take a look. Thanks.

Thank you, the added section accurately captures the problem and possible solutions. Could you add some information on when you want to address this part? When going to beta with the feature?

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting here, as far as I remember at the previous discussion somewhere, no change is required in the scheduler, is it still true?
If not, or if you're not sure, please make sure adding someone from sig-scheduling to review it.

@Karthik-K-N
Copy link
Author

Revisiting here, as far as I remember at the previous discussion somewhere, no change is required in the scheduler, is it still true? If not, or if you're not sure, please make sure adding someone from sig-scheduling to review it.

Yes, still it stands the same, No changes are required from scheduler, As you mentioned in previous review I have updated the KEP design details to include the phrase

    Scheduler will automatically schedule any pending pods.
    This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler watches the available capacity of the node and creates pods accordingly.

For more reference our previous discussion: Slack, Enhancement

@sanposhiho
Copy link
Member

Sure, as long as that part isn't changed, I believe it's ok from sig-scheduling side.

@Karthik-K-N
Copy link
Author

Sure, as long as that part isn't changed, I believe it's ok from sig-scheduling side.

yeah, I will reach out, if there are any changes in future, Thank you for taking a look.

Co-authored-by: kishen-v <[email protected]>
@Karthik-K-N
Copy link
Author

Based on the conversation over slack and here, Added a section about compatability with Cluster Autoscaler: https://github.com/Karthik-K-N/enhancements/tree/node-resize/keps/sig-node/3953-node-resource-hot-plug#compatability-with-cluster-autoscaler.
Please take a look. Thanks.

Thank you, the added section accurately captures the problem and possible solutions. Could you add some information on when you want to address this part? When going to beta with the feature?

Thanks for the review, We plan to provide compatibility with autoscaler by storing the initial node resource in node object and also plan to provide this feature during beta graduation so we can get started now. But we are open for community feedback. Thank you.

@towca
Copy link

towca commented May 30, 2025

Thanks for the review, We plan to provide compatibility with autoscaler by storing the initial node resource in node object and also plan to provide this feature during beta graduation so we can get started now. But we are open for community feedback. Thank you.

Sounds good to me, thanks!

@elmiko
Copy link
Contributor

elmiko commented May 30, 2025

gave a read of the kep, no comments or suggestions currently, but happy to see this work progressing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.