-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-3953: Node Resource Hot Plug #3955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
a7bc843
to
03e927f
Compare
/assign @mrunalp @SergeyKanzhelev @klueska |
/cc |
/cc |
/cc |
Hi All, We have addressed comments and concerns, and have added more details to KEP Note: One pending issue needs opinion is regarding updating Please let us know your thoughts |
Is Node autoscaling a consideration for this feature, or do you expect for it to only be used on Nodes that aren't expected to be horizontally autoscaled? Cluster Autoscaler uses existing Nodes as templates for how a new Node from the same group would look like. How should it treat Nodes which were affected by hot plugging? Can we distinguish which Nodes were affected? In any case, if the feature is meant to be used with Cluster Autoscaler and if it can result in Nodes from the same NodeGroup having different allocatable values, it will require making CA aware of the hotplugging. Otherwise CA will just pick a random Node from the group and assume that all new Nodes will have the same allocatable value - which will lead to bad decisions (repeatedly provisioning a Node that a pending Pod can't actually fit on, or not provisioning a Node that could actually help a pending Pod). |
FYI, when resource decreased and with kubelet restart, it will check if kubelet can admit the pod, The |
Yeah, Thank you for bringing this point and though this KEP primarily focuses on resource hot plug and some level of attention is also given for hot unplug like setting the node as not ready incase of hot unplug. |
We intend to use this feature on all cluster irrespective of their autoscaling capability.
Can we make autoscaler aware of hotplug capabilities? If not any recommendation/opinion on the kind of API that can make the autoscaler aware of change in resource capacity. |
- Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload. | ||
- To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur. | ||
|
||
- Lack of coordination about change in resource availability across kubelet/runtime/plugins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What plugins are you referring to? Device plugins or something else. Might consider clarification here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah, I meant to say NRI plugins, Let me explicitly specify that.
or if it has to be terminated due to resource crunch. | ||
* Recalculate OOM adjust score and Swap limits: | ||
* Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed. | ||
* Handling unplug of reserved CPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we capture the corner case where node capacity would increase but some of the resources go away, e.g. new cpuset is bigger but doesn't contain the old cpuset (dumb example 0-3
-> 2-9
). CPU manager sync returns an error and node goes to not ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for now we thought to only record an event during resync failures. may be need to check how we can identify what was the assigned cpu set before hotplug.
/cc |
|
||
## Glossary | ||
|
||
Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running) | |
Node Compute Resource Hot Plug: Dynamically add node compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We inclined towards keeping hotplug in glossary.
|
||
Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running) | ||
|
||
Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running) | |
Node Compute Resource Hot Unplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) from the node, either via software (make resources go offline) or via hardware (physical removal while the system is running) |
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [Handling hotplug events](#handling-hotplug-events) | ||
- [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Swap is mentioned in this document and since Swap is supported only with cgroup v2. Does this mean intention for this KEP is not to be supported for cgroup v1 or ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically speaking cgroupv1 is feature frozen, but I think we'll accidentally have support for cgroupv1 unless we explicitly skip the detection on v1
|
||
* Achieve seamless node capacity expansion through hot plugging resources. | ||
* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager to accommodate alterations in the node's resource allocation. | ||
* Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add something about huge pages ?
Any implications for Nodes using DPDK / SRIOV with relation to CNIs which we need to consider ?
Based on the conversation over slack and here, Added a section about compatability with Cluster Autoscaler: https://github.com/Karthik-K-N/enhancements/tree/node-resize/keps/sig-node/3953-node-resource-hot-plug#compatability-with-cluster-autoscaler. |
Thank you, the added section accurately captures the problem and possible solutions. Could you add some information on when you want to address this part? When going to beta with the feature? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisiting here, as far as I remember at the previous discussion somewhere, no change is required in the scheduler, is it still true?
If not, or if you're not sure, please make sure adding someone from sig-scheduling to review it.
Yes, still it stands the same, No changes are required from scheduler, As you mentioned in previous review I have updated the KEP design details to include the phrase
For more reference our previous discussion: Slack, Enhancement |
Sure, as long as that part isn't changed, I believe it's ok from sig-scheduling side. |
yeah, I will reach out, if there are any changes in future, Thank you for taking a look. |
Co-authored-by: kishen-v <[email protected]>
Thanks for the review, We plan to provide compatibility with autoscaler by storing the initial node resource in node object and also plan to provide this feature during beta graduation so we can get started now. But we are open for community feedback. Thank you. |
Sounds good to me, thanks! |
gave a read of the kep, no comments or suggestions currently, but happy to see this work progressing. |
Uh oh!
There was an error while loading. Please reload this page.