Skip to content

Commit 1a0635c

Browse files
committed
Addressed more comments
1 parent 2c98d9d commit 1a0635c

File tree

1 file changed

+46
-25
lines changed
  • keps/sig-scheduling/5194-reserved-for-workloads

1 file changed

+46
-25
lines changed

keps/sig-scheduling/5194-reserved-for-workloads/README.md

Lines changed: 46 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -348,24 +348,24 @@ list, it will not add a reference to the pod.
348348

349349
##### Deallocation
350350
The resourceclaim controller will remove Pod references from the `ReservedFor` list just
351-
like it does now using the same logic. For non-Pod references, the controller will recognize
352-
a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will
353-
remove the reference from the list when those resources are removed. For other types,
354-
it will be the responsibility of the workload controller/user that created the `ResourceClaim`
355-
to remove the reference to the non-Pod resource from the `ReservedFor` list when no pods
351+
like it does now using the same logic. But for non-Pod references, it
352+
will be the responsibility of the controller/user that created the `ResourceClaim` to
353+
remove the reference to the non-Pod resource from the `ReservedFor` list when no pods
356354
are consuming the `ResourceClaim` and no new pods will be created that references
357355
the `ResourceClaim`.
358356

359357
The resourceclaim controller will then discover that the `ReservedFor` list is empty
360358
and therefore know that it is safe to deallocate the `ResourceClaim`.
361359

362-
This requires that the resourceclaim controller watches the workload types that will
363-
be supported. For other types of workloads, there will be a requirement that the workload
364-
controller has permissions to update the status subresource of the `ResourceClaim`. The
365-
resourceclaim controller will also try to detect if an unknown resource referenced in the
366-
`ReservedFor` list has been deleted from the cluster, but that requires that the controller
367-
has permissions to get or list resources of the type. If the resourceclaim controller is
368-
not able to check, it will just wait until the reference in the `ReservedFor` list is removed.
360+
This requires that the controller/user has permissions to update the status
361+
subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if
362+
the resource referenced in the `ReservedFor` list has been deleted from the cluster, but
363+
that requires that the controller has permissions to get or list resources of the type. If the
364+
resourceclaim controller is not able to check, it will just wait until the reference in
365+
the `ReservedFor` list is removed. The resourceclaim controller will not have a watch
366+
on the worikload resource, so there is no guarantee that the controller will realize that
367+
the resource has been deleted. This is an extra check since it is the responsibility of
368+
the workload controller to update the claim.
369369

370370
##### Finding pods using a ResourceClaim
371371
If the reference in the `ReservedFor` list is to a non-Pod resource, controllers can no longer
@@ -473,6 +473,22 @@ The API server will no longer accept the new fields and the other components wil
473473
not know what to do with them. So the result is that the `ReservedFor` list will only
474474
have references to pod resources like today.
475475

476+
Any ResourceClaims that have already been allocated when the feature was active will
477+
have non-pod references in the `ReservedFor` list after a downgrade, but the controllers
478+
will not know how to handle it. There are two problems that will arise as a result of
479+
this:
480+
- The workload controller will also have been downgraded if it is in-tree, meaning that
481+
it will not remove the reference to workload resource from the `ReservedFor` list, thus
482+
leading to a situation where the claim will never be deallocated.
483+
- For new pods that gets scheduled, the scheduler will add pod references in the
484+
`ReservedFor` list, despite there being a non-pod reference here. So it ends up with
485+
both pod and non-pod references, which breaks the assumptions of the design. We need
486+
to make sure the system can handle this, as it might also happen as a result of
487+
disablement and the enablement of the feature.
488+
489+
We can address this by adding the logic for the non-pod references in 1.34 and then
490+
add the actual feature for 1.35.
491+
476492
### Version Skew Strategy
477493

478494
If the kubelet is on a version that doesn't support the feature but the rest of the
@@ -482,10 +498,9 @@ since it will still check whether the `Pod` is references in the `ReservedFor` l
482498
If the API server is on a version that supports the feature, but the scheduler
483499
is not, the scheduler will not know about the new fields added, so it will
484500
put the reference to the `Pod` in the `ReservedFor` list rather than the reference
485-
in the `spec.ReservedFor` list. As a result, the workload will get scheduled, but
486-
it will be subject to the 256 limit on the size of the `ReservedFor` list and the
487-
controller creating the `ResourceClaim` will not find the reference it expects
488-
in the `ReservedFor` list when it tries to remove it.
501+
in the `spec.ReservedFor` list. It will do this even if there is already a non-pod
502+
reference in the `spec.ReservedFor` list. This leads to the challenge described
503+
in the previous section.
489504

490505
## Production Readiness Review Questionnaire
491506

@@ -543,18 +558,21 @@ No
543558

544559
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
545560

546-
Applications that were already running will continue to run and the allocated
547-
devices will remain so.
548-
For the resource types supported directly, the resource claim controller will not remove the
549-
reference in the `ReservedFor` list, meaning the devices will not be deallocated. If the workload
550-
controller is responsible for removing the reference, deallocation will work as long as the
551-
feature isn't also disabled in the controllers. If they are, deallocation will not happen in this
552-
situation either.
561+
Applications that were already running will continue to run. But if a pod have to be
562+
re-admitted by a kubelet where the feature has been disabled, it will not be able to, since
563+
the kubelet will not find a reference to the pod in the `ReservedFor` list.
564+
565+
The feature will also be disabled for in-tree workload controllers, meaning that they will
566+
not remove the reference to the pod from the `ReservedFor` list. This means the list will never
567+
be empty and the resourceclaim controller will never deallocate the claim.
553568

554569
###### What happens if we reenable the feature if it was previously rolled back?
555570

556571
It will take affect again and will impact how the `ReservedFor` field is used during allocation
557-
and deallocation.
572+
and deallocation. Since this scenario allows a ResourceClaim with the `spec.ReservedFor` field
573+
to be set and then have the scheduler populate the `ReservedFor` list with pods when the feature
574+
is disabled, we will end up in a situation where the `ReservedFor` list can contain both non-pod
575+
and pod references. We need to make sure all components can handle that.
558576

559577
###### Are there any tests for feature enablement/disablement?
560578

@@ -723,7 +741,10 @@ No
723741

724742
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
725743

726-
No
744+
Yes and no. We are adding two new fields to the ResourceClaim type, but neither are of a collection type
745+
so they should have limited impact on the total size of the objects. However, this feature means that
746+
we no longer need to keep a complete list of all pods using a ResourceClaim, which can significantly
747+
reduce the size of ResourceClaim objects shared by many pods.
727748

728749
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
729750

0 commit comments

Comments
 (0)