-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA: DRA integration MVP #7530
CA: DRA integration MVP #7530
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @towca, the
|
@@ -0,0 +1,75 @@ | |||
package utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this line, you have to add the following comment:
/*
Copyright 2020 The Kubernetes Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
3ad1daa
to
cd3ed99
Compare
9847f70
to
33afea0
Compare
e3cee18
to
3629698
Compare
/wg device-management |
I only looked at the PR description so far. I am not sure whether I am qualified to review the code 😰
Agreed, this can come later. https://pkg.go.dev/k8s.io/[email protected]/resourceclaim might be the right package for that. If "resourceclaim" works as a package name, then you could already use it now and then later only the import path needs to be changed once that new code is upstream. One thing that I don't see mentioned is node readiness: directly after provisioning a new node, the DRA driver pod(s) have to be started on the node and publish their ResourceSlices before the new node is really ready to support pods using claims. I know that there are checks for device plugins to treat such a node as "not ready" while plugins are starting. Is there something similar for DRA drivers? This might not be needed for MVP. |
29b9543
to
2daeb9b
Compare
Yeah unfortunately it's quite a lot, IMO we should split reviewing responsibilities here:
/assign @BigDarkClown
I was wondering about that part, hard to test it without an actual driver (the test driver from k/k e2e tests doesn't run into this). So a new Node can be Ready but not yet have its ResourceSlices exposed, is that right? If so then we indeed need to reimplement the device plugin logic. How can CA know that a Node is supposed to have any slices exposed though? For device plugins we depend on a specific label on the Node (if the label is on the Node but no allocatable resource -> fake the node as not ready). @jackfrancis could you verify how this behaves in practice during your manual testing? Are we seeing double scale-ups? |
Great work @towca! /lgtm |
Thanks for the reviews again, merging! /unhold |
Suggestion for post MVP work: document how to autoscale a cluster that uses dynamically-allocated resources. |
Suggestion for post MVP work: Making --gpu-total function work with Dynamic Resource Allocation (DRA). In my understanding, --gpu-tolal option relies on the counting methodology depending on Device Plugins. |
I don't know what Instead of special-case options for specific categories of hardware perhaps the concept can be generalized? |
@pohly |
Categorizing devices doesn't make sense to me and I also find the concept of "counting GPUs" questionable. GPUs can be large and small. A single parameter which applies equally to everything that is a GPU seems like an oversimplification. But I don't have a stake in this - just my two cents. |
I agree with @pohly on the larger point. I also did some cursory digging and this feature seems to be very incompletely implemented. Documentation suggests it only works with GKE. My instinct would be to work on deprecating this feature. What do folks think? cc @MaciekPytel @towca |
If this function is deprecated, I will accept it.
My idea was to add another field to ResourceSlice like spec.devices.type.
This option can specify a gpu type as
It may be needless to say, but I think the document's information is not the most recent and |
That is what doesn't make sense to me. Someone would have to define types in such a way that different vendors and users can agree on the definition and then use it consistently. "I know a GPU when I see it" is not good enough. I suspect the usage of the |
It sounds difficult...
I'm on the same page. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler.
This PR implements an MVP of DRA integration in Cluster Autoscaler. Not all CA features work with DRA yet (see list below), but most logic is implemented and DRA autoscaling can be tested in a real cluster.
Changes summary:
dynamicresources.Provider
which retrieves and snapshots all DRA objects.dynamicresources.Snapshot
which allows modifying the DRA objects snapshot obtained fromProvider
, and exposes the DRA objects to the scheduler framework.dynamicresources.Snapshot
insideClusterSnapshotStore
) wheneverClusterSnapshot
is modified (inPredicateSnapshot
methods).StaticAutoscaler
integration-like unit tests covering DRA scenarios.The following features don't work with DRA yet, and will be tackled post-MVP:
nominatedNodeName
set), it adds the pod to thenominatedNodeName
in the snapshot without checking predicates, or even removing the preempted pod (so the Node can be effectively "overscheduled"). If such a Pod uses DRA, we'll have to run scheduler predicates to actually obtain the necessary ResourceClaim allocations. If we just force-add such a Pod to the snapshot without modifying the claims, CA doesn't see the Node's ResourceSlices as used and can just schedule another Pod to use them in the simulations.TemplateNodeInfo()
has to be called (since scheduler predicates aren't currently run there). Forcing "missing" DS pods onto template Nodes also won't work if the DS pods use DRA (we'll have to duplicate their resource claims and start running scheduler predicates there as well).BasicSnapshotStore
(oldBasicClusterSnapshot
) implementation is integrated withdynamicresources.Snapshot
.DeltaSnapshotStore
(oldDeltaClusterSnapshot
) isn't yet. We'll need to add some delta-like capability todynamicresources.Snapshot
for that in addition to justClone()
.Additionally, the following points will have to be tackled post-MVP:
RunOnce
, then processed byPodListProcessors
and passed toScaleUp
. The Pods' DRA objects, on the other hand, live in thedynamicresources.Snapshot
insideClusterSnapshot
. This leaves us with a risk of the two data sources diverging quite easily (e.g. aPodListProcessor
injecting a "fake" unschedulable Pod to the list, but not injecting the Pod's ResourceClaims to theClusterSnapshot
). To make this better, in a follow-up PR the unschedulable pods will be fully moved insideClusterSnapshot
andPodListProcessors
will interact with them viaClusterSnapshot
methods instead of getting them by argument and returning.PredicateSnapshot
methods likeAddNodeInfo()
orSchedulePod()
can fail because of DRA-related issues, but don't always clean up the partial DRA snapshot modifications that happened prior to the error. This shouldn't be an issue for MVP because these errors would mean aborting the whole loop anyway (see the "Error policy" point above), and the snapshot would be recreated from scratch in the next loop. It will be an issue if we want to proceed with the loop when seeing these errors though, so it should probably be tackled with the Error policy point above.I'll cut issues for the post-MVP work listed above after the PR is merged.
Which issue(s) this PR fixes:
Fixes kubernetes/kubernetes#118612
Special notes for your reviewer:
The PR is split into meaningful commits intended to be reviewed sequentially.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: