-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-5224: Node Resource Discovery #5319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
marquiz
commented
May 19, 2025
- One-line PR description: Node Resource Discovery KEP
- Issue link: Node Resource Discovery #5224
- Other comments:
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: marquiz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@marquiz: GitHub didn't allow me to request PR reviews from the following users: Tal-or. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
||
// GetDynamicRuntimeConfig is a streaming interface for receiving dynamically | ||
// changing runtime and node configuration | ||
rpc GetDynamicRuntimeConfig(DynamicRuntimeConfigRequest) returns (stream DynamicRuntimeConfigResponse) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this doesn't really seem like a runtime config..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name came into existence with a background thought of creating a more generic "channel" for the runtime to inform kubelet about changes (without kubelet polling). But I agree, the name is bad and the background idea probably too.
Related, @ffromani suggested having a completely separate service (e.g. DiscoveryService
, in addition to RuntimeService
and ImageService
) for handling this.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah or maybe ResourceService
or NodeService
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but big +1 on separate CRI service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check, I will change this in the next update. I don't have strong opinions on this. One detail is that with a separate service there is the possibility to connect to a 3rd-party NodeService/ResourceService agent. That may be a good thing for some special users(?)
|
||
```go | ||
const ( | ||
ResourceTopologyZoneCore = "Core" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these currently are a subset of the cadvisor machine info fields. is this all the kubelet uses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, picked the ones that the kubelet is interested in. Of course we can add all that we can imagine (e.g. the current cpu topology levels from linux kernel, i.e. package, die, cluster, core)
|
||
The kubelet reads MacineID, BootID and SystemUUID from the attributes of the | ||
resource topology tree. If this information is not present the kubelet uses the | ||
cAdvisor MachineInfo as a fallback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forever? or will it stop after GA? What if this info is never available? should the kubelet exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops I should read the section above: set the node not ready for the last question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good question in general, whether the MachineID, BootID and SystemUUID should come from the runtime or should the kubelet figure out these itself. I put them here to be able to ditch cAdvisor MachineInfo completely. Glad to hear opinions on this.
|
||
A rollout could fail e.g. because of a bug in the CRI runtime, the runtime | ||
returning data that the kubelet cannot consume. In this case the node will be | ||
set into NotReady state. Existing workloads should not be affected but new pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should the cluster admin do in this case? is NotReady the correct signal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cluster admin can rollback. Regarding NotReady, I'm open to guidance, suggestions. I think the node should not be ready as new stuff shouldn't be scheduled there. But what else? Events, conditions?
|
||
TBD. | ||
|
||
- [ ] Events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably want a metric if the kubelet falls back to cadvisor for a certain resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean metric for falling back to cAdvisor? The thinking in this proposal is that it's all or nothing: all resources (native, i.e. cpu, memory, hugepages and swap) come from the cAdvisor or nothing.
heartbeats, leader election, etc.) | ||
--> | ||
|
||
If Node Resource Hotplug ([KEP-3953][kep-3953]) is enabled in tandem, the node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this creates a dependency on this for 3953. @Karthik-K-N do you +1 this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also remove this snippet. This proposal (KEP-5224) alone does not cause this. But with both features enabled at the same time, this is what will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, These two KEP's compliment each other if enabled together otherwise works as expected independently.
- fix typos
|
||
A rollout could fail e.g. because of a bug in the CRI runtime, the runtime | ||
returning data that the kubelet cannot consume. In this case the node will be | ||
set into NotReady state. Existing workloads should not be affected but new pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When node becomes NotReady, pods remaining in the Running state is considered controversial behavior. We may correct this in future releases.
ref: kubernetes/kubernetes#125618
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HirazawaUi for pointing out this. I'll add this detail in the next update.
|
||
### Goals | ||
|
||
- Ability for kubelet to get node resources (capacity) from the CRI runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does your goals include maintaining the availability of the kubelet's /metrics/resource endpoint after migrating to CRI-based node resource discovery?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is completely independent of the stats/metrics stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a correlation here.
There is an ambiguity: in the future, will the data returned by the /stats
and /metrics/resource
endpoints represent the hardware resources owned by the Kubernetes Node resource object, or the actual physical hardware resources of the node?
Prior to this KEP, these were nearly equivalent. However, once a user enables the feature gate proposed in this KEP, it might allocate only a subset of the node's resources (e.g., a portion of CPU cores) to the kubelet through extensible mechanisms. This could create ambiguity in the reported metrics.
/cc |
- Ability for kubelet to get node resources (capacity) from the CRI runtime | ||
- Retain current functionality of cpu, memory and topology managers | ||
- API that can support dynamic node capacity changes | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly state that we plan to have cpu topology info is enough and compatible with Slurm-like HPC workloads.