Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA: DRA integration MVP #7530

Merged
merged 10 commits into from
Dec 20, 2024
Merged

Conversation

towca
Copy link
Collaborator

@towca towca commented Nov 25, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler.

This PR implements an MVP of DRA integration in Cluster Autoscaler. Not all CA features work with DRA yet (see list below), but most logic is implemented and DRA autoscaling can be tested in a real cluster.

Changes summary:

  • Implement some utils for interacting with ResourceClaims. These would probably ideally be upstream, but IMO we can revisit this later.
  • Introduce dynamicresources.Provider which retrieves and snapshots all DRA objects.
  • Introduce dynamicresources.Snapshot which allows modifying the DRA objects snapshot obtained from Provider, and exposes the DRA objects to the scheduler framework.
  • Implement a very rudimentary utilization calculation for device pools, so that scale-down has something to act on.
  • Start sanitizing the DRA objects when duplicating Nodes and their DS-like pods in the snapshot.
  • Modify all relevant DRA objects (tracked in dynamicresources.Snapshot inside ClusterSnapshotStore) whenever ClusterSnapshot is modified (in PredicateSnapshot methods).
  • Add StaticAutoscaler integration-like unit tests covering DRA scenarios.

The following features don't work with DRA yet, and will be tackled post-MVP:

  • Priority-based preempting pods using DRA. If CA sees an unschedulable pod waiting for scheduler preemption (with nominatedNodeName set), it adds the pod to the nominatedNodeName in the snapshot without checking predicates, or even removing the preempted pod (so the Node can be effectively "overscheduled"). If such a Pod uses DRA, we'll have to run scheduler predicates to actually obtain the necessary ResourceClaim allocations. If we just force-add such a Pod to the snapshot without modifying the claims, CA doesn't see the Node's ResourceSlices as used and can just schedule another Pod to use them in the simulations.
  • DaemonSet/static pods using DRA (in some cases). Similarly to the first point, DS/static pods using DRA won't work correctly during scale-from-0-nodes scenarios where TemplateNodeInfo() has to be called (since scheduler predicates aren't currently run there). Forcing "missing" DS pods onto template Nodes also won't work if the DS pods use DRA (we'll have to duplicate their resource claims and start running scheduler predicates there as well).
  • DeltaSnapshotStore. Only the BasicSnapshotStore (old BasicClusterSnapshot) implementation is integrated with dynamicresources.Snapshot. DeltaSnapshotStore (old DeltaClusterSnapshot) isn't yet. We'll need to add some delta-like capability to dynamicresources.Snapshot for that in addition to just Clone().
  • DRA admin mode. DRA has an "admin mode" feature, guarded by a separate feature guard. A request for admin access to a device can be expressed in a ResourceClaim, meaning that the claim would only be used to monitor or otherwise manage the device, but not actually use it. Such claims can be allocated with devices that are already allocated in another ResourceClaim. This has implications on some of the DRA logic in CA (e.g. such claims shouldn't be counted when computing utilization), but this isn't implemented yet. ResourceClaims with admin mode are treated the same as regular ResourceClaims in the MVP implementation.

Additionally, the following points will have to be tackled post-MVP:

  • Unschedulable pods state. As pointed out by @MaciekPytel during the initial review of WIP: Implement DRA support in Cluster Autoscaler #7350, the state for unschedulable pods is kept in two separate places after this PR. The pod objects themselves are just a list obtained from listers in RunOnce, then processed by PodListProcessors and passed to ScaleUp. The Pods' DRA objects, on the other hand, live in the dynamicresources.Snapshot inside ClusterSnapshot. This leaves us with a risk of the two data sources diverging quite easily (e.g. a PodListProcessor injecting a "fake" unschedulable Pod to the list, but not injecting the Pod's ResourceClaims to the ClusterSnapshot). To make this better, in a follow-up PR the unschedulable pods will be fully moved inside ClusterSnapshot and PodListProcessors will interact with them via ClusterSnapshot methods instead of getting them by argument and returning.
  • Calculating Node utilization for DRA resources.
    • Cluster Autoscaler scale-down logic calculates a utilization value for each Node in the cluster. Only Nodes with utilization below a configured threshold are considered candidates for scale-down. Utilization is computed separately for every resource, by dividing the sum of Pods' requests by the Node allocatable value (it's not looking at "real" usage, just requests). If a Node has a GPU, only the GPU utilization is taken into account - CPU and memory are ignored. If a Node doesn't have a GPU, its utilization is the max of CPU and memory utilization values. It's not immediately clear how to extend this model for DRA resources.
    • This PR extends the utilization model in the following way. All Node-local Devices exposed in ResourceSlices for a given Node are grouped by their Driver and Pool. If ResourceSlices don't present a complete view of a Pool, utilization is not calculated and the Node is not scaled down until that changes. Allocated Node-local devices from the scheduled Pods' ResourceClaims are also grouped by their Driver and Pool. Utilization is calculated separately for every <Driver, Pool> pair by dividing the number of allocated devices by the number of total devices for a given Driver and Pool. Utilization of the Node is the max of these <Driver, Pool> utilization values. Similarly to how GPUs are treated, if DRA resources are exposed for a Node only DRA utilization is taken into account - CPU and memory are ignored.
    • The logic above should work pretty well for Nodes with one local Pool with identical, expensive (compared to CPU/memory) Devices - basically mimicking the current GPU approach. For other scenarios, it will behave predictably but it doesn't seem nearly flexible enough to be usable in practice (I might be wrong here though). What if a given DRA Device is not more expensive than CPU/memory and shouldn't be prioritized? What if it's some fake, "free" Device that shouldn't actually be taken into account when calculating utilization? Etc.
    • IMO solving this properly will require some discussions and design, so this PR only implements the "provisional" utilization calculation logic described above, so that scale-down can be tested.
  • Error policy. Cluster Autoscaler tends to error out and break the whole loop in case of any unexpected errors, and the implementation in this PR mostly follows this approach for simplicity. This is not a good direction in general, we've had a number of issues in GKE CA where a bug related to a small subset of pods/nodes would break CA completely because of it. IMO we should holistically rethink if CA can proceed with the loop when it encounters DRA-related errors (and ideally non-DRA-related errors as well but that's a separate issue).
  • Transaction-like clean-up on DRA-related errors in PredicateSnapshot. PredicateSnapshot methods like AddNodeInfo() or SchedulePod() can fail because of DRA-related issues, but don't always clean up the partial DRA snapshot modifications that happened prior to the error. This shouldn't be an issue for MVP because these errors would mean aborting the whole loop anyway (see the "Error policy" point above), and the snapshot would be recreated from scratch in the next loop. It will be an issue if we want to proceed with the loop when seeing these errors though, so it should probably be tackled with the Error policy point above.
  • Upstreaming ResourceClaim utils. This PR introduces utility functions for interacting with ResourceClaims. Some of them are already implemented upstream, and some of them could be used by other components if they were implemented upstream. We should upstream as many of the utility functions as possible and migrate all the relevant components to use them. This will help ensure consistent behavior between different components interacting with the claims.
  • Integration tests for disabled DRA flag. All existing unit tests are run with DRA enabled whenever possible, to cover more code paths and ensure that the behavior stays the same when DRA is enabled but there are no DRA objects. These tests should also cover the "DRA is disabled and there are no DRA objects" case, as the flag just enables more codepaths. The "DRA is enabled and there are DRA objects" case has a lot of integration tests, but we're missing coverage for the "DRA is disabled and there are DRA objects" case. We should add some integration tests for that to ensure that CA still behaves in a sane way.
  • Node readiness after scale-up. Nodes with custom resources exposed by device plugins (e.g. GPUs) have condition Ready before they actually expose the resources. Cluster Autoscaler has to hack them to be not-Ready until they do expose the resources, otherwise the unschedulable pods don't pack on the Nodes in filter_out_schedulable and CA does another, unnecessary scale-up. It seems that we might have a similar problem with DRA Nodes being Ready before all their ResourceSlices are published. We actually didn't observe it happening when testing this PR manually, but we need to dig further to confirm that it's not an issue, or fix it if it is.
  • E2E tests. We should add some e2e tests covering a subset of the DRA autoscaling scenarios from integration tests.
  • Documentation. We should document the feature on the Node autoscaling page in k8s docs after Concepts/ClusterAdministration: Expand Node Autoscaling documentation website#45802 finally gets merged.

I'll cut issues for the post-MVP work listed above after the PR is merged.

Which issue(s) this PR fixes:

Fixes kubernetes/kubernetes#118612

Special notes for your reviewer:

The PR is split into meaningful commits intended to be reviewed sequentially.

Does this PR introduce a user-facing change?

Experimental support for DRA autoscaling is implemented, disabled by default. To enable it, set the `--enable-dynamic-resource-allocation` flag in Cluster Autoscaler, and the `DynamicResourceAllocation` feature guard in the cluster. Additionally, RBAC configuration must allow Cluster Autoscaler to list the following new objects: `resource.k8s.io/ResourceClaim`, `resource.k8s.io/ResourceSlice`, `resource.k8s.io/DeviceClass`. The support is experimental and not yet intended for production use. In particular, details about how Cluster Autoscaler reacts to DRA resources might change in future releases. Most autoscaling scenarios should work, but with potentially reduced performance. A list of missing features can be found in https://github.com/kubernetes/autoscaler/pull/7530.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/blob/9de7f62e16fc5c1ea3bd40689487c9edc7fa5057/keps/sig-node/4381-dra-structured-parameters/README.md

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 25, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 25, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 25, 2024
@Shubham82
Copy link
Contributor

Hi @towca, the Tests / test-and-verify test case failed, which is required to merge this PR.
the error is related to the boilerplate. Here is the error:

The boilerplate header is wrong for /cluster-autoscaler/simulator/dynamicresources/utils/utilization.go

@@ -0,0 +1,75 @@
package utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this line, you have to add the following comment:

/*
Copyright 2020 The Kubernetes Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Shubham82
Copy link
Contributor

The first commit in the PR is just a squash of #7479, #7497, and #7529, and it shouldn't be a part of this review. The PR will be rebased on top of master after the others are merged.

As mentioned in the PR description, I am putting this PR on hold to avoid an accidental merge.

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 26, 2024
@towca towca force-pushed the jtuznik/dra-actual branch from 3ad1daa to cd3ed99 Compare November 28, 2024 11:20
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 28, 2024
@towca towca force-pushed the jtuznik/dra-actual branch 2 times, most recently from 9847f70 to 33afea0 Compare November 29, 2024 11:43
@towca towca force-pushed the jtuznik/dra-actual branch 7 times, most recently from e3cee18 to 3629698 Compare December 10, 2024 18:16
@pohly
Copy link
Contributor

pohly commented Dec 11, 2024

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Dec 11, 2024
@pohly
Copy link
Contributor

pohly commented Dec 11, 2024

I only looked at the PR description so far. I am not sure whether I am qualified to review the code 😰

Implement some utils for interacting with ResourceClaims. These would probably ideally be upstream, but IMO we can revisit this later.

Agreed, this can come later. https://pkg.go.dev/k8s.io/[email protected]/resourceclaim might be the right package for that. If "resourceclaim" works as a package name, then you could already use it now and then later only the import path needs to be changed once that new code is upstream.

One thing that I don't see mentioned is node readiness: directly after provisioning a new node, the DRA driver pod(s) have to be started on the node and publish their ResourceSlices before the new node is really ready to support pods using claims. I know that there are checks for device plugins to treat such a node as "not ready" while plugins are starting. Is there something similar for DRA drivers? This might not be needed for MVP.

@towca towca force-pushed the jtuznik/dra-actual branch from 29b9543 to 2daeb9b Compare December 11, 2024 17:50
@towca
Copy link
Collaborator Author

towca commented Dec 11, 2024

I only looked at the PR description so far. I am not sure whether I am qualified to review the code 😰

Yeah unfortunately it's quite a lot, IMO we should split reviewing responsibilities here:

  • @BigDarkClown you're probably the most familiar with core CA code, could you focus your review on how the DRA logic fits into the existing CA logic? In particular:
    • Could you verify that the behavior doesn't change if the DRA flag is disabled?
    • Could you review the error handling?
    • Could you check if I missed any places that need to take DRA into account (beside the parts with TODO(DRA))?
    • Could you review the code structure etc.?
  • @pohly you're definitely the most familiar with the DRA logic/assumptions, could you focus your review on the DRA parts and if the assumptions match? In particular:
    • ResourceClaim utils (1st commit) - is the logic correct there?
    • "Sanitizing" ResourceClaims (4th commit). During the simulations, CA duplicates Nodes and their DaemonSet pods. Duplicated Nodes/Pods go through "sanitization" so that they aren't actually identical to the originals. Doing this for DS pods using DRA has some corner cases (described in the comments), WDYT about this? Do we anticipate DS pods using DRA?
    • Calculating utilization (5th commit + the PR description) - does my initial "provisional" idea make sense? Is it implemented correctly?
    • Modifying the state of DRA objects during scheduling simulations (6th commit). This is tied to the ResourceClaim utils, as a lot of them are used here. Is the logic correct? One question that comes to mind - if we remove the last pod reserving a shared claim, should it be deallocated?
    • Integration tests (7th commit) - do the scenarios listed there make sense and behave as you would expect them to? Are some major scenarios missing?
  • @jackfrancis AFAIU you're familiar with both the DRA and CA parts, could you focus your review somewhere "between" the two parts above?

/assign @BigDarkClown
/assign @pohly
/assign @jackfrancis

One thing that I don't see mentioned is node readiness: directly after provisioning a new node, the DRA driver pod(s) have to be started on the node and publish their ResourceSlices before the new node is really ready to support pods using claims. I know that there are checks for device plugins to treat such a node as "not ready" while plugins are starting. Is there something similar for DRA drivers? This might not be needed for MVP.

I was wondering about that part, hard to test it without an actual driver (the test driver from k/k e2e tests doesn't run into this). So a new Node can be Ready but not yet have its ResourceSlices exposed, is that right? If so then we indeed need to reimplement the device plugin logic. How can CA know that a Node is supposed to have any slices exposed though? For device plugins we depend on a specific label on the Node (if the label is on the Node but no allocatable resource -> fake the node as not ready).

@jackfrancis could you verify how this behaves in practice during your manual testing? Are we seeing double scale-ups?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 20, 2024
@jackfrancis
Copy link
Contributor

Great work @towca!

/lgtm

@towca
Copy link
Collaborator Author

towca commented Dec 20, 2024

Thanks for the reviews again, merging!

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 20, 2024
@k8s-ci-robot k8s-ci-robot merged commit 50c6590 into kubernetes:master Dec 20, 2024
5 of 6 checks passed
@sftim
Copy link
Contributor

sftim commented Dec 20, 2024

Suggestion for post MVP work: document how to autoscale a cluster that uses dynamically-allocated resources.

@ttsuuubasa
Copy link

Suggestion for post MVP work: Making --gpu-total function work with Dynamic Resource Allocation (DRA).

In my understanding, --gpu-tolal option relies on the counting methodology depending on Device Plugins.
Therefore, this option doesn't work with --enable-dynamic-resource-allocation.
I think ClusterAutoscaler(CA) needs to ask DRA for the way to identify a device in ResourceSlice are GPU in a standard manner.
A field in ResourceSlice to classify a device type currently entrusts to the vendor's free format in "attributes" fields.
Consequently, CA doesn't have a certain method to count GPUs.
Are you willing to support this option when DRA is enabled?

@pohly
Copy link
Contributor

pohly commented Dec 26, 2024

I don't know what --gpu-total does, but a bespoke option specifically for "GPUs" (whatever they are) sounds wrong to me.

Instead of special-case options for specific categories of hardware perhaps the concept can be generalized?

@ttsuuubasa
Copy link

@pohly
--gpu-total restricts the minimum/maximum number of GPUs in cluster and CA will not scale the cluster beyond these numbers.
Although your idea is certainly implementable as current DRA is, I'm doubtful whether there are more use cases of the overall devices being limited than GPUs limited.
If taking the point as noted above "Calculating Node utilization for DRA resources" into account, perhaps devices in ResourceSlice should be categorized?

@pohly
Copy link
Contributor

pohly commented Jan 15, 2025

Categorizing devices doesn't make sense to me and I also find the concept of "counting GPUs" questionable. GPUs can be large and small. A single parameter which applies equally to everything that is a GPU seems like an oversimplification.

But I don't have a stake in this - just my two cents.

@jackfrancis
Copy link
Contributor

I agree with @pohly on the larger point. I also did some cursory digging and this feature seems to be very incompletely implemented. Documentation suggests it only works with GKE.

My instinct would be to work on deprecating this feature.

What do folks think? cc @MaciekPytel @towca

@ttsuuubasa
Copy link

If this function is deprecated, I will accept it.

Categorizing devices doesn't make sense to me

My idea was to add another field to ResourceSlice like spec.devices.type.

GPUs can be large and small. A single parameter which applies equally to everything that is a GPU seems like an oversimplification.

This option can specify a gpu type as --gpu-total=<gpu-type>:<min>:<max>.
<gpu-type> must match the value of the label in the node group that manages the GPU-enabled nodes of that type and users can set any value.
I thought this specification can prevent the oversimplification you mentioned.

this feature seems to be very incompletely implemented. Documentation suggests it only works with GKE.

It may be needless to say, but I think the document's information is not the most recent and
this option can be used with Cluster-api and has been adopted in Red Hat OpenShift.

@pohly
Copy link
Contributor

pohly commented Jan 16, 2025

My idea was to add another field to ResourceSlice like spec.devices.type.

That is what doesn't make sense to me. Someone would have to define types in such a way that different vendors and users can agree on the definition and then use it consistently. "I know a GPU when I see it" is not good enough.

I suspect the usage of the --gpu-total option is to ensure a certain minimum amount of compute power (min value) and to ensure that total costs are not exceeded (max value). In both cases, counting GPUs is only a coarse approximation.

@ttsuuubasa
Copy link

Someone would have to define types in such a way that different vendors and users can agree on the definition and then use it consistently.

It sounds difficult...
In that case, what do you think about how to solve the issue of "Calculating Node utilization for DRA resources" as written in the description?
It needs to differentiate between DRA devices, whether they are more expensive than CPU/memory or not and whether they should be prioritized or not.
I thought that categorizing the devices as I mentioned would lead to solving this issue.

I suspect the usage of the --gpu-total option is to ensure a certain minimum amount of compute power (min value) and to ensure that total costs are not exceeded (max value). In both cases, counting GPUs is only a coarse approximation.

I'm on the same page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Development

Successfully merging this pull request may close these issues.

DRA: integration with cluster autoscaler
8 participants