KEP-4815+5234: DRA Update 4815 and split out 5234 for mixins #5238

mortent · 2025-04-11T17:16:43Z

One-line PR description: Update ParitionableDevices KEP and create separate ResourceSliceMixins KEP

Issue link: DRA: ResourceSlice Mixins #5234
Issue link: DRA: Partitionable Devices #4815

Other comments: This updates 4815 to reflect the functionality that was actually implemented for 1.33. The mixins feature was originally part of 4815, but go cut from the scope for 1.33. So this moves that functionality into a separate KEP 5234.

…s KEP

mortent · 2025-04-11T17:28:27Z

/wg device-management
/cc @pohly @johnbelamaric @klueska

k8s-ci-robot · 2025-04-18T04:02:41Z

@bg-furiosa: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

bg-chun · 2025-04-18T04:02:59Z

/lgtm

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

k8s-ci-robot · 2025-04-23T17:37:06Z

New changes are detected. LGTM label has been removed.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

keps/prod-readiness/sig-scheduling/5234.yaml

keps/sig-scheduling/5234-dra-resourceslice-mixins/kep.yaml

johnbelamaric · 2025-04-24T17:08:00Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+  be reduced and a larger number of devices can be published within a single
+  ResourceSlice.
+- Enable defining devices with more attributes, capacities, and consumed counters.
+- Enable defining counter sets with more counters.


Two things we should consider (not sure if these are goals or implementation choices): 1) enabling mixins to be per-pool not per-resource slice; 2) enabling counters to be per-pool not per-resource slice.

If necessary, these could be considered in beta. But I do think we're going to need them in time. The second may belong in partitionable not here.

Yeah, we have an issue for that in kubernetes/kubernetes#130785. We definitely need to make a decision on this in this cycle, as I think changing this must happen over two releases. I was hoping to handle this separately from this KEP, but open to including it here.

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

mortent · 2025-04-24T19:53:24Z

This KEP adds a few new limits on the size of slices/maps in the ResourceSlice in addition to the ones that were added as part of the Partitionable Devices and other features. But as I've tried a few scenarios I've realized the result is not great, as it makes it possible to add more attributes, capacity and counters through mixin without actually reducing the storage size of the ResourceSlice.

An example is that we currently limit the total number of counters across the counter sets in a ResourceSlice to 32. As a result, it is impossible to create a counter set with more than 32 counters. But with mixins, I can create a counter set mixin that is only referenced from a single counter set to create larger counter sets. This doesn't reduce the number of counters defined in a ResourceSlice, it just forces users to "abuse" mixins to bypass the limit.

We should set the limits based on the total number of attributes, capacity, and counters across the ResourceSlice, rather than based on whether they are defined in a Device or a Mixin.

I suggest we set the limits to something like:

Total combined number of attributes and capacity in a ResourceSlice is 4096 (32 * 128 devices)
Total number of counters is 256
Total number of consumed counters is 2048 (16 * 128 devices)

So no special limits for mixins, they count against the same limits as the properties defined in devices. With these limits the worst case size for the ResourceSlice increases from 1,107,864 bytes to 1,288,825 bytes as a result of adding mixins.

I think changing the limits for the counters should be pretty straightforward since those only affects fields that are still in alpha. So we can add the new ResourceSlice-wide limits and remove the more granular ones.
But I'm not sure if we will be able to easily remove the limit of 32 attributes+capacity per device, so I think this is something that would need to happen over two releases. But I think adding the ResourceSlice-wide limit of 4096 to cover both mixins and devices should be safe so we avoid any mixin-specific limits.

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

jackfrancis · 2025-04-24T22:30:31Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+### Implementation
+
+The DRA scheduler plugin will flatten the counter sets and devices before
+going through the allocation process. This will happen as part of conversion


Putting my SIG Autoscaling hat on, is this a sufficient description of the plan to provide a standard interface for rendering the "flattened" ResourceSlice (after following and processing all mixin references)? In the worst case scenario those conversions happen surgically throughout various parts of the k/k codebase, which would make it hard for downstream components like cluster-autoscaler and karpenter to plumb into durable, reusable libraries.

Yeah, I think this is a good question. Every tool that needs to understand the full device definitions will need to flatten the mixins, but the suggested implementation here doesn't lend itself easily to reuse as the flattening happens as part of conversion into the scheduler-specific format.
I added a separate section under "Risks and Mitigations" for this. The most obvious solution here is that we provide a library that handles the flattening, although I'm not sure which types (most likely v1beta2) we should do this for.

I'm almost certain that with the current integration the Cluster Autoscaler is already using the conversion logic from the v1beta1 API to the internal dynamic-resource-allocation version used by the DRA scheduler plugin. So it shouldn't be a concern for that integration.

But for other uses, like implementing a view of the fully flattened devices in kubectl, it is less clear how flattening as part of conversion can be reused.

jackfrancis · 2025-04-24T22:32:43Z

This KEP adds a few new limits on the size of slices/maps in the ResourceSlice in addition to the ones that were added as part of the Partitionable Devices and other features. But as I've tried a few scenarios I've realized the result is not great, as it makes it possible to add more attributes, capacity and counters through mixin without actually reducing the storage size of the ResourceSlice.

An example is that we currently limit the total number of counters across the counter sets in a ResourceSlice to 32. As a result, it is impossible to create a counter set with more than 32 counters. But with mixins, I can create a counter set mixin that is only referenced from a single counter set to create larger counter sets. This doesn't reduce the number of counters defined in a ResourceSlice, it just forces users to "abuse" mixins to bypass the limit.

We should set the limits based on the total number of attributes, capacity, and counters across the ResourceSlice, rather than based on whether they are defined in a Device or a Mixin.

I suggest we set the limits to something like:

Total combined number of attributes and capacity in a ResourceSlice is 4096 (32 * 128 devices)

Total number of counters is 256

Total number of consumed counters is 2048 (16 * 128 devices)

So no special limits for mixins, they count against the same limits as the properties defined in devices. With these limits the worst case size for the ResourceSlice increases from 1,107,864 bytes to 1,288,825 bytes as a result of adding mixins.

I think changing the limits for the counters should be pretty straightforward since those only affects fields that are still in alpha. So we can add the new ResourceSlice-wide limits and remove the more granular ones. But I'm not sure if we will be able to easily remove the limit of 32 attributes+capacity per device, so I think this is something that would need to happen over two releases. But I think adding the ResourceSlice-wide limit of 4096 to cover both mixins and devices should be safe so we avoid any mixin-specific limits.

Great initial thoughts, I would go ahead and move your thinking and preliminary conclusions into the KEP where it will probably get the most engagement.

pohly · 2025-04-25T07:09:49Z

But I'm not sure if we will be able to easily remove the limit of 32 attributes+capacity per device, so I think this is something that would need to happen over two releases.

Yes, this would need ratcheting.

We have already gradually moved away from individual per-slice and per-map limits towards aggregating at higher levels. You proposal is now basically to move this up to the root level of the entire slice. This makes sense to me and @thockin has approved the previous aggregated API limits, but it still is a bit unusual. Therefore I would like to hear from others what they think about taking this approach to the logical conclusion.

dom4ha

Thanks for splitting these KEPs, as it's more self-contained and it's much easier to understand this enhancement now.

keps/sig-scheduling/5234-dra-resourceslice-mixins/kep.yaml

dom4ha · 2025-04-25T09:34:47Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+
+### Implementation
+
+The DRA scheduler plugin will flatten the counter sets and devices before


By flattening we still risk that large number of references combined with a large size of mixing could cause scheduler OOM, as there would be no mechanism of keeping the consumption under control.

I think the in-memory representation should stay as is, but the allocator should iterate over mixins somehow. Is that feasible?

We could do that, but it means additional work to dereference the mixins every time they are needed and more complexity in the allocator to handle it.

I also think the question around memory usage in the scheduler goes beyond just mixins. The memory usage per device does matter, but so does the number of devices. Currently we allow a maximum of 127 devices per ResourceSlice, but there is no other limit on the number of ResourceSlices than the number of objects for a single type in Kubernetes. We can make changes to the allocator to make sure we don't try to keep all devices in memory at the same time, but that comes with other challenges.

We probably should look at whether we should place a limit on the number of devices in a cluster and then see what kind of impact that has on the memory usage of the scheduler.

After taking a look at the code, I do think handling ResourceSlices unflattened in the allocator should be very doable. For counters in either counter sets or device counter consumption, we just need to walk the mixins and we can probably turn those into an appropriate data structure for that as part of conversion. For device mixins, we do need to flatten it at some point in order to evaluate the CEL expressions for the selectors. But we already do an extra conversion from the internal dynamic resource allocation API to the v1beta1 (eventually v1beta2) API for CEL evaluation, and we also prep the attributes and capacity maps for CEL. I think both places can be used to flatten the mixins for CEL evaluation.

So I think we have alternatives if flattening devices in the v1beta1 to dynamic-resource-allocation conversion causes the per-device memory usage to be too high.

As for the total number of devices, it seems like these are managed through an informer in the DRAManager. But I see that we only do conversion for devices that are available for the specific node being processed in the allocator. The question of whether there are enough safeguards to make sure a large number of ResourceSlices doesn't cause the scheduler to OOM might be a question for the GA process of https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters.

@pohly

we just need to walk the mixins and we can probably turn those into an appropriate data structure for that as part of conversion

Or even better, implement that walking as part of the CEL variable lookup. There's an open TODO. I expect performance to get better because it avoids memory allocations.

Speaking of CEL... there is one aspect of mixins that we hadn't considered. No matter how we specify the limits for the resource slice itself, we now can have more than the traditional 32 attributes/quantities per device. On top of that, looking them up becomes more costly. I think we can ignore the extra lookup cost (cost estimates are just that - estimates), but the higher number of attributes has a real impact on validation because the worst case size of the maps becomes higher, which makes the estimated cost of the same expression higher than it was before. If an expression was just below the cost limit in 1.32, then it might be above the cost limit in 1.33, which breaks users.

We could increase the cost limit in 1.33, but I believe it is impossible to predict by how much. A complex expression might have O(32 * 32) squared costs now and O(64 * 64) if we double the number of allowed attributes/quantities, so doubling the cost limit would not be enough.

I don't see a good solution. Perhaps we should make increasing the number of allowed attributes/quantities a non-goal? We only added it because I asked, not because that really was the intent. So perhaps it is not a big loss if we don't do it? If we go down that route, then in addition to whatever aggregate limits we have for a valid slice we also have to validate that after resolving mixings, the number of attributes/quantities is not higher than what was allowed before.

Yeah, resolving the mixins and validate the number of attributes and capacities during validation seems reasonable. We are already doing most of the work to accommodate this anyway. I like that approach. If there are actual requirements to support a higher number of attributes/capacities for a device, that is a different discussion than mixins.

Only caveat is that if we want to implement kubernetes/kubernetes#130785 as @johnbelamaric brought up in #5238 (comment), we won't be able to validate until allocation time. But quite a bit of validation will have to happen at allocation time if we do this.

@pohly I think the one remaining question from this discussion that haven't been addressed yet is whether a large number of ResourceSlices in a cluster could cause the scheduler to OOM. Is there already a safe-guard for this in the informers used in the DRAManager? And do we need any additional protection to make sure the allocator doesn't convert so many ResourceSlices that it becomes an issue?

I suppose it could have that effect, but I am less worried about that than I am about something in a ResourceClaim causing problems. Admins have to ensure that they run their control plane with enough resources to handle the size and load in their cluster. Installing a DRA driver and configuring it is under the control of the admin. If that causes stability issues, they can uninstall again and retry after granting the kube-scheduler access to more memory.

What we should check is how much memory that might be (your "performance envelope" comment in kubernetes/kubernetes#131198 (comment)).

It's great we don't have to flatten everything.

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

dom4ha · 2025-04-25T09:50:16Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+type ResourceSliceSpec struct {
+  ...
+
+  // Mixins defines the mixins available for devices and counter sets


Can you expand the comment to clearly define the purpose of mixins and how they will be merged with other attributes (how possible conflicts are handled)

This is described in the comments on the Includes fields on CounterSet, Device, and DeviceCounterConsumption types. I think that is the right place to document this, as the order of the mixins listed matters for conflicting properties.

mortent · 2025-04-26T22:55:24Z

Great initial thoughts, I would go ahead and move your thinking and preliminary conclusions into the KEP where it will probably get the most engagement.

Updated the KEP with the proposal for limits with a new section under the Design Details. I've left the Partitionable Devices KEP as is so it reflects what was implemented for 1.33. If these changes get accepted and implemented, I will also update the Partitionable Devices KEP.

mortent · 2025-04-30T17:23:03Z

The KEP has been updated based on the comments.

dom4ha · 2025-05-01T23:57:58Z

/approve
It looks good from the scheduler perspective

johnbelamaric

/approve

Some non-blocking nits

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

k8s-ci-robot · 2025-05-02T16:45:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dom4ha, johnbelamaric, mortent

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-scheduling/OWNERS~~ [dom4ha,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 11, 2025

k8s-ci-robot requested a review from alculquicondor April 11, 2025 17:16

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Apr 11, 2025

k8s-ci-robot requested a review from Huang-Wei April 11, 2025 17:16

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Apr 11, 2025

github-project-automation bot added this to SIG Scheduling Apr 11, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling Apr 11, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 11, 2025

Update ParitionableDevices KEP and create separate ResourceSliceMixin…

8f1251c

…s KEP

mortent force-pushed the SplitOutMixinsFeature branch from bf1d8cf to 8f1251c Compare April 11, 2025 17:22

mortent changed the title ~~KEP-4815 and KEP-5234: DRA Update 4815 and split out 5234 for mixins~~ KEP-4815+5234: DRA Update 4815 and split out 5234 for mixins Apr 11, 2025

k8s-ci-robot requested review from johnbelamaric, klueska and pohly April 11, 2025 17:28

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Apr 11, 2025

github-project-automation bot added this to SIG Node: Dynamic Resource Allocation Apr 11, 2025

github-project-automation bot moved this to 🆕 New in SIG Node: Dynamic Resource Allocation Apr 11, 2025

pohly moved this from 🆕 New to 🔖 Ready in SIG Node: Dynamic Resource Allocation Apr 15, 2025

pohly moved this from 🔖 Ready to 👀 In review in SIG Node: Dynamic Resource Allocation Apr 15, 2025

mortent mentioned this pull request Apr 17, 2025

[WIP] DRA ResourceSlice mixins kubernetes/kubernetes#131357

Open

bg-chun mentioned this pull request Apr 18, 2025

[KEP-4815]DRA Partitionable device kubernetes/kubernetes#130764

Merged

k8s-ci-robot assigned bg-chun Apr 18, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2025

jackfrancis reviewed Apr 18, 2025

View reviewed changes

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md Show resolved Hide resolved

pohly reviewed Apr 22, 2025

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2025

pohly reviewed Apr 24, 2025

View reviewed changes

Addressed more comments

20b3c07

mortent force-pushed the SplitOutMixinsFeature branch from c40743b to 20b3c07 Compare April 24, 2025 16:13

johnbelamaric reviewed Apr 24, 2025

View reviewed changes

jackfrancis reviewed Apr 24, 2025

View reviewed changes

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md Outdated Show resolved Hide resolved