-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-4815+5234: DRA Update 4815 and split out 5234 for mixins #5238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
mortent
commented
Apr 11, 2025
- One-line PR description: Update ParitionableDevices KEP and create separate ResourceSliceMixins KEP
- Issue link: DRA: ResourceSlice Mixins #5234
- Issue link: DRA: Partitionable Devices #4815
- Other comments: This updates 4815 to reflect the functionality that was actually implemented for 1.33. The mixins feature was originally part of 4815, but go cut from the scope for 1.33. So this moves that functionality into a separate KEP 5234.
bf1d8cf
to
8f1251c
Compare
/wg device-management |
@bg-furiosa: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/lgtm |
New changes are detected. LGTM label has been removed. |
c40743b
to
20b3c07
Compare
be reduced and a larger number of devices can be published within a single | ||
ResourceSlice. | ||
- Enable defining devices with more attributes, capacities, and consumed counters. | ||
- Enable defining counter sets with more counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things we should consider (not sure if these are goals or implementation choices): 1) enabling mixins to be per-pool not per-resource slice; 2) enabling counters to be per-pool not per-resource slice.
If necessary, these could be considered in beta. But I do think we're going to need them in time. The second may belong in partitionable not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we have an issue for that in kubernetes/kubernetes#130785. We definitely need to make a decision on this in this cycle, as I think changing this must happen over two releases. I was hoping to handle this separately from this KEP, but open to including it here.
This KEP adds a few new limits on the size of slices/maps in the ResourceSlice in addition to the ones that were added as part of the Partitionable Devices and other features. But as I've tried a few scenarios I've realized the result is not great, as it makes it possible to add more attributes, capacity and counters through mixin without actually reducing the storage size of the ResourceSlice. An example is that we currently limit the total number of counters across the counter sets in a ResourceSlice to 32. As a result, it is impossible to create a counter set with more than 32 counters. But with mixins, I can create a counter set mixin that is only referenced from a single counter set to create larger counter sets. This doesn't reduce the number of counters defined in a ResourceSlice, it just forces users to "abuse" mixins to bypass the limit. We should set the limits based on the total number of attributes, capacity, and counters across the ResourceSlice, rather than based on whether they are defined in a Device or a Mixin. I suggest we set the limits to something like:
So no special limits for mixins, they count against the same limits as the properties defined in devices. With these limits the worst case size for the ResourceSlice increases from 1,107,864 bytes to 1,288,825 bytes as a result of adding mixins. I think changing the limits for the counters should be pretty straightforward since those only affects fields that are still in alpha. So we can add the new ResourceSlice-wide limits and remove the more granular ones. |
### Implementation | ||
|
||
The DRA scheduler plugin will flatten the counter sets and devices before | ||
going through the allocation process. This will happen as part of conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting my SIG Autoscaling hat on, is this a sufficient description of the plan to provide a standard interface for rendering the "flattened" ResourceSlice (after following and processing all mixin references)? In the worst case scenario those conversions happen surgically throughout various parts of the k/k codebase, which would make it hard for downstream components like cluster-autoscaler and karpenter to plumb into durable, reusable libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is a good question. Every tool that needs to understand the full device definitions will need to flatten the mixins, but the suggested implementation here doesn't lend itself easily to reuse as the flattening happens as part of conversion into the scheduler-specific format.
I added a separate section under "Risks and Mitigations" for this. The most obvious solution here is that we provide a library that handles the flattening, although I'm not sure which types (most likely v1beta2) we should do this for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @towca
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm almost certain that with the current integration the Cluster Autoscaler is already using the conversion logic from the v1beta1
API to the internal dynamic-resource-allocation version used by the DRA scheduler plugin. So it shouldn't be a concern for that integration.
But for other uses, like implementing a view of the fully flattened devices in kubectl, it is less clear how flattening as part of conversion can be reused.
Great initial thoughts, I would go ahead and move your thinking and preliminary conclusions into the KEP where it will probably get the most engagement. |
Yes, this would need ratcheting. We have already gradually moved away from individual per-slice and per-map limits towards aggregating at higher levels. You proposal is now basically to move this up to the root level of the entire slice. This makes sense to me and @thockin has approved the previous aggregated API limits, but it still is a bit unusual. Therefore I would like to hear from others what they think about taking this approach to the logical conclusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for splitting these KEPs, as it's more self-contained and it's much easier to understand this enhancement now.
|
||
### Implementation | ||
|
||
The DRA scheduler plugin will flatten the counter sets and devices before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By flattening we still risk that large number of references combined with a large size of mixing could cause scheduler OOM, as there would be no mechanism of keeping the consumption under control.
I think the in-memory representation should stay as is, but the allocator should iterate over mixins somehow. Is that feasible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but it means additional work to dereference the mixins every time they are needed and more complexity in the allocator to handle it.
I also think the question around memory usage in the scheduler goes beyond just mixins. The memory usage per device does matter, but so does the number of devices. Currently we allow a maximum of 127 devices per ResourceSlice, but there is no other limit on the number of ResourceSlices than the number of objects for a single type in Kubernetes. We can make changes to the allocator to make sure we don't try to keep all devices in memory at the same time, but that comes with other challenges.
We probably should look at whether we should place a limit on the number of devices in a cluster and then see what kind of impact that has on the memory usage of the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking a look at the code, I do think handling ResourceSlices unflattened in the allocator should be very doable. For counters in either counter sets or device counter consumption, we just need to walk the mixins and we can probably turn those into an appropriate data structure for that as part of conversion. For device mixins, we do need to flatten it at some point in order to evaluate the CEL expressions for the selectors. But we already do an extra conversion from the internal dynamic resource allocation API to the v1beta1
(eventually v1beta2
) API for CEL evaluation, and we also prep the attributes and capacity maps for CEL. I think both places can be used to flatten the mixins for CEL evaluation.
So I think we have alternatives if flattening devices in the v1beta1
to dynamic-resource-allocation conversion causes the per-device memory usage to be too high.
As for the total number of devices, it seems like these are managed through an informer in the DRAManager. But I see that we only do conversion for devices that are available for the specific node being processed in the allocator. The question of whether there are enough safeguards to make sure a large number of ResourceSlices doesn't cause the scheduler to OOM might be a question for the GA process of https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we just need to walk the mixins and we can probably turn those into an appropriate data structure for that as part of conversion
Or even better, implement that walking as part of the CEL variable lookup. There's an open TODO. I expect performance to get better because it avoids memory allocations.
Speaking of CEL... there is one aspect of mixins that we hadn't considered. No matter how we specify the limits for the resource slice itself, we now can have more than the traditional 32 attributes/quantities per device. On top of that, looking them up becomes more costly. I think we can ignore the extra lookup cost (cost estimates are just that - estimates), but the higher number of attributes has a real impact on validation because the worst case size of the maps becomes higher, which makes the estimated cost of the same expression higher than it was before. If an expression was just below the cost limit in 1.32, then it might be above the cost limit in 1.33, which breaks users.
We could increase the cost limit in 1.33, but I believe it is impossible to predict by how much. A complex expression might have O(32 * 32) squared costs now and O(64 * 64) if we double the number of allowed attributes/quantities, so doubling the cost limit would not be enough.
I don't see a good solution. Perhaps we should make increasing the number of allowed attributes/quantities a non-goal? We only added it because I asked, not because that really was the intent. So perhaps it is not a big loss if we don't do it? If we go down that route, then in addition to whatever aggregate limits we have for a valid slice we also have to validate that after resolving mixings, the number of attributes/quantities is not higher than what was allowed before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, resolving the mixins and validate the number of attributes and capacities during validation seems reasonable. We are already doing most of the work to accommodate this anyway. I like that approach. If there are actual requirements to support a higher number of attributes/capacities for a device, that is a different discussion than mixins.
Only caveat is that if we want to implement kubernetes/kubernetes#130785 as @johnbelamaric brought up in #5238 (comment), we won't be able to validate until allocation time. But quite a bit of validation will have to happen at allocation time if we do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly I think the one remaining question from this discussion that haven't been addressed yet is whether a large number of ResourceSlices in a cluster could cause the scheduler to OOM. Is there already a safe-guard for this in the informers used in the DRAManager? And do we need any additional protection to make sure the allocator doesn't convert so many ResourceSlices that it becomes an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it could have that effect, but I am less worried about that than I am about something in a ResourceClaim causing problems. Admins have to ensure that they run their control plane with enough resources to handle the size and load in their cluster. Installing a DRA driver and configuring it is under the control of the admin. If that causes stability issues, they can uninstall again and retry after granting the kube-scheduler access to more memory.
What we should check is how much memory that might be (your "performance envelope" comment in kubernetes/kubernetes#131198 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great we don't have to flatten everything.
type ResourceSliceSpec struct { | ||
... | ||
|
||
// Mixins defines the mixins available for devices and counter sets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand the comment to clearly define the purpose of mixins and how they will be merged with other attributes (how possible conflicts are handled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is described in the comments on the Includes
fields on CounterSet
, Device
, and DeviceCounterConsumption
types. I think that is the right place to document this, as the order of the mixins listed matters for conflicting properties.
d2deb9a
to
3eddab7
Compare
Updated the KEP with the proposal for limits with a new section under the Design Details. I've left the Partitionable Devices KEP as is so it reflects what was implemented for 1.33. If these changes get accepted and implemented, I will also update the Partitionable Devices KEP. |
3c7a619
to
d58cf8c
Compare
The KEP has been updated based on the comments. |
/approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Some non-blocking nits
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dom4ha, johnbelamaric, mortent The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |