From 9272ef0b4fb7b58195add0b543daa2645dee50ad Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Wed, 16 Apr 2025 15:15:07 +0000 Subject: [PATCH] KEP-5229: Asynchronous API calls during scheduling --- keps/prod-readiness/sig-scheduling/5229.yaml | 3 + .../README.md | 961 ++++++++++++++++++ .../kep.yaml | 28 + 3 files changed, 992 insertions(+) create mode 100644 keps/prod-readiness/sig-scheduling/5229.yaml create mode 100644 keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md create mode 100644 keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/kep.yaml diff --git a/keps/prod-readiness/sig-scheduling/5229.yaml b/keps/prod-readiness/sig-scheduling/5229.yaml new file mode 100644 index 00000000000..5e828592a0b --- /dev/null +++ b/keps/prod-readiness/sig-scheduling/5229.yaml @@ -0,0 +1,3 @@ +kep-number: 5229 +alpha: + approver: "" diff --git a/keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md b/keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md new file mode 100644 index 00000000000..987776fce7d --- /dev/null +++ b/keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/README.md @@ -0,0 +1,961 @@ +# KEP-5229: Asynchronous API calls during scheduling + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [1: Where and how to handle API calls in the kube-scheduler](#1-where-and-how-to-handle-api-calls-in-the-kube-scheduler) + - [1.1: Handle API calls in the scheduling queue](#11-handle-api-calls-in-the-scheduling-queue) + - [1.2: Handle API calls in the handleSchedulingFailure](#12-handle-api-calls-in-the-handleschedulingfailure) + - [1.3: Use advanced queue and don't block the pod from being scheduled in the meantime](#13-use-advanced-queue-and-dont-block-the-pod-from-being-scheduled-in-the-meantime) + - [2: How to make the API calls asynchronous](#2-how-to-make-the-api-calls-asynchronous) + - [2.1: Just dispatch goroutines](#21-just-dispatch-goroutines) + - [2.2: Make the API calls queued](#22-make-the-api-calls-queued) + - [2.3: Send API calls through a kube-scheduler's cache](#23-send-api-calls-through-a-kube-schedulers-cache) + - [Another things worth considering](#another-things-worth-considering) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +This KEP proposes making all API calls during scheduling asynchronous, by introducing a new kube-scheduler-wide way of handling such calls. + +## Motivation + + + +Scheduling performance is crucial. One of the bottlenecks is the API calls done during the scheduling cycle. +The binding cycle is already asynchronous, but it would still be beneficial to re-evaluate whether the current model of busy-waiting goroutines is good long-term. +The following operations involve pod-based API calls during scheduling: +1) Updating a Pod status in `handleSchedulingFailure` when a Pod is unschedulable. +2) Preemption - `ClearNominatedNodeName` and pod eviction are already asynchronous with KEP-4832. +3) Pod binding - is in the goroutine, but still could be considered. +4) [Feature proposal: https://github.com/kubernetes/kubernetes/issues/130668] Updating a status of a Pod that is rejected by the `PreEnqueue` plugins in the scheduling queue. +5) [Feature proposal] Set `nominatedNodeName` in delayed binding scenarios. + +In-tree plugins' operations that involve API calls during scheduling: +6) Volume binding. +7) DRA ResourceClaim deallocating in `PostFilter`. +8) DRA removing `ReservedFor` in `Unreserve`. +9) DRA ResourceClaims binding. +These could be consireded to be async, but not necessarily. + +Making one universal approach of handling API calls in the kube-scheduler could allow these calls to be consistent, as well as better controlling +the number of dispatched goroutines. Asynchronous preemption could also be migrated to this approach. + +### Goals + + + +- New asynchronous way of making API calls is introduced to the kube-scheduler. +- Pod update API call is replaced with an asynchronous version. +- Make it possible to update a pod to set the `PreEnqueue` status asynchronously. + +### Non-Goals + + + +## Proposal + + + +There are a few ways to make API calls asynchronous. +They are introduced below to facilitate discussion and identify the most suitable solution. + +These questions have to be answered: +1) Where and how to handle pod status updates during queueing and scheduling. +2) How to make the API calls asynchronous. + +Also, race (collisions) between multiple API calls for a single pod should be mitigated by the design. + +### 1: Where and how to handle API calls in the kube-scheduler + +There are multiple possible ways to handle the API calls, especially for pod status update. +Other (potential) use cases should also be considered when choosing the solution. +Three ways are presented below. + +#### 1.1: Handle API calls in the scheduling queue + +One possible approach is to send the API calls through a scheduling queue. +This allows delaying putting the pod into `unschedulablePods` after updating the pod. +This prevents race conditions from parallel updates of a single pod because, during the API call, +the pod is in-flight and thus not eligible for rescheduling. + +A new method could be added to the `PriorityQueue`, which will take the function to be called asynchronously. +It should also make sure the pod is stored in `inFlightPods` to register the cluster events that will happen during the asynchronous part. +Calling `AddUnschedulableIfNotPresent` at the end ensures there won't be any race with the asynchronous pod update. +Because the pod would need to be in `inFlightPods` during the API call, the size of `inFlightEvents` might increase, +but as long as the API call executes quickly, there won't be a significant memory issue. + +Example solution could look like: + +```go +// Author: @sanposhiho +func (p *PriorityQueue) AddUnschedulableAsync(pInfo *framework.QueuedPodInfo, fn func() error) { + // Make sure the Pod is in inFlightPods before starting the goroutine + + go func() { // Or another way of dispatching + // Run fn first + if err := fn(); err != nil { ... } + + // Push the pod back to the unschedQ after completing fn(). + p.AddUnschedulableIfNotPresent(...) + }() +} +``` + +This way, we could cover pod status updates during the failure handler (1) and pod status updates for `PreEnqueue` plugins (4). +Asynchronous preemption (2) could be migrated to this approach by adding a possibility to return a function from `PostFilter` plugins in `PostFilterResult` +and calling this function probably in the failure handler together with the status update. + +However, this method cannot be used for setting the `nominatedNodeName` scenario (5) because this operation occurs in the successful scheduling as well. +Therefore, additional effort would have to be made to specifically ensure that the `nominatedNodeName` doesn't collide with a potential status update. +Probably, before this status update in the failure handler, the code should try to cancel the set `nominatedNodeName` API call or wait until it finishes. +After that, it should proceed with setting the unschedulable status via the API. The binding call might similarly need to wait. + +Another aspect to consider is how to dispatch the goroutines, as discussed in [how to make the API calls asynchronous](#2-how-to-make-the-api-calls-asynchronous) section. + +Pros: +- Allows delaying putting unschedulable pods back to the queue until the API update completes. +- Prevents race conditions for parallel updates of a single pod by delaying the `AddUnschedulableIfNotPresent` call. +- Can easily cover status updates for both scheduling failures and `PreEnqueue` failures. +- Asynchronous preemption could be migrated to this approach, increasing consistency. + +Cons: +- Handling of failures might not be consistent, requiring `AddUnschedulableAsync` to be called in two places. +- Delaying the `AddUnschedulableAsync` call increases pod queuing latency because the initial backoff timestamp is set there. +- Cannot be used for the `nominatedNodeName` scenario, requiring additional effort and separate handling. +- Might visibly increase the size of `inFlightEvents` if API calls are slow or if there are many calls. + + +#### 1.2: Handle API calls in the handleSchedulingFailure + +Another approach could be to make all unschedulable status update API calls within `handleSchedulingFailure`. +This would make this handler the only error reporting path. Synchronous API calls within this handler could be made asynchronous, +but additional effort would be needed to prevent race conditions. This could be achieved by blocking the retries of the pod using `PreEnqueue` +(similar to asynchronous preemption) or by implementing advanced queueing logic. + +This way, again, we could cover pod status updates during the failure handler (1), +but pod status updates for `PreEnqueue` plugins (4) will require more refactoring by either: +- Running a simplified scheduling cycle for pods that were rejected by the `PreEnqueue` to update the pod condition. + This might negatively impact scheduling performance because a portion of the scheduling cycles will be spent for pods that are ultimately unschedulable + Moreover, `PreEnqueue` plugins might also need to be called within this simplified scheduling cycle, + or alternatively, `PreFilter` plugins could implement the necessary PreEnqueue logic, duplicating it. +- Calling `handleSchedulingFailure` directly from the scheduling queue when a pod is rejected by the `PreEnqueue`. + This might be feasible, although it would create a circular dependency between the scheduling queue and the handler; + however, it wouldn't have the same performance implications as the solution above. + +Asynchronous preemption could also be migrated to this approach by exposing a function, +provided that the blocking behavior in `PreEnqueue` is consistent with the actual preemption blocking mechanism. + +Again, for setting the `nominatedNodeName` scenario (5), this method cannot be used because this operation occurs in the successful scheduling as well. +Therefore, additional effort would have to be made to specifically ensure that the `nominatedNodeName` doesn't collide with a potential status update. + +Pros: +- Makes the failure handler the single path of reporting unschedulable status errors. +- Asynchronous preemption could potentially be migrated to this approach, increasing consistency. +- Pod would be immediately put into the scheduling queue, starting the backoff timer right away. + +Cons: +- Requires additional effort to prevent race conditions for updates. +- Handling PreEnqueue rejections requires significant refactoring (implementing a `simplified scheduling cycle or direct `handleSchedulingFailure` call). + - Simplified scheduling cycle for `PreEnqueue` rejections could impact performance and duplicate `PreEnqueue` logic. + - Direct `handleSchedulingFailure` call would introduce circular dependency. +- Cannot be used for the `nominatedNodeName` scenario, requiring additional effort and separate handling. + + +#### 1.3: Use advanced queue and don't block the pod from being scheduled in the meantime + +A third approach could involve allowing the pod to enter the scheduling queue and be scheduled again even before the status update API call completes, without blocking it. +This would require implementing advanced logic for queueing API calls in the kube-scheduler and migrating **all** pod-based API calls done during scheduling to this method, +potentially including the binding API call. The new component should be able to resolve any conflicts in the incoming API calls as well as parallelize them properly, +e.g., don't parallelize two updates of the same pod. This requires [making the API calls queued](#22-make-the-api-calls-queued) or +[sending API calls through a kube-scheduler's cache](#23-send-api-calls-through-a-kube-schedulers-cache) to be implemented. + +All pod-based scenarios (1 - 5) could and should be implemented when choosing this approach. +Still, a single error reporting path for pod condition updates could be considered but wouldn't be required. + +Pros: +- Allows the pod to be scheduled again even before the API call completes. +- Simplifies introducing new API calls to the kube-scheduler if the collision handling logic is configured correctly. + +Cons: +- Requires implementing complex, advanced queueing logic. +- Necessitates migrating **all** pod-based API calls to this method. +- Implementing collision resolution (e.g., for same-pod updates) is complex. + + +### 2: How to make the API calls asynchronous + +Another thing worth considering is how to indeed make the API calls asynchronous. + +#### 2.1: Just dispatch goroutines + +With appropriate handling of races during updates, we could just dispatch goroutines with API calls. +A potential drawback is that we won't limit the number of these goroutines and won't be able to, e.g., delay the calls. +Limiting goroutines could still be easily achieved by having some group with a limited number of goroutines and a simple queue that will store pending calls. +Some delay might potentially appear due to side effects, especially when there will be problems with the kube-apiserver, +so some higher-level mechanism such as (1.1) or (1.2) would need to prevent pod update races. + +Pros: +- Simple to implement if the appropriate race handling is chosen. +- Can easily be extended with a simple queue and worker pool to limit number of goroutines. + +Cons: +- Does not inherently support delaying calls. +- Higher-level mechanisms (like 1.1 or 1.2) would be needed to prevent pod update races. +- `nominatedNodeName` scenario support would require more effort in (1.1) or (1.2). + + +#### 2.2: Make the API calls queued + +To make asynchronous dispatching more advanced, a queueing approach could be explored. +A queue might understand what the API calls are intended to do and eventually delay, skip, or merge them, +e.g., don't set `nominatedNodeName` when pod binding is enqueued. +Initially, it could be a framework, which might be extended in the future, e.g., by introducing the possibility of setting delays. + +However, it is questionable what should happen if two update API calls for the same pod are enqueued. +This might not happen in (1.1) and (1.2) if we wait for the previous status update call to complete or terminate it. +Otherwise, as currently the update is done on a copy of a pod, these two might collide. If the update were to be done on the original pod object, +it might be possible to simply decide what API calls should be applied for a pod: +- Status update (patch): Apply newest API call +- Binding: Ignore status update API calls +- Delete (in preemption): Ignore status update as well as binding API calls + +```go +type APICallType string + +const ( + StatusUpdate APICallType = "status_update" + Binding APICallType = "binding" + Delete APICallType = "delete" +) + +type PodAPICall struct { + podID types.UID + callType APICallType + fn func() +} + +type APIQueue struct { + ... +} + +func (aq *APIQueue) Add(podAPICall PodAPICall) { + // If API call for specific podID is already enqueued, + // check the callType and skip or replace the call depending on precedence. + ... +} + +func (aq *APIQueue) Run() { + // Dispatch limited number of goroutines if queue is non empty. + ... +} +``` + +Pros: +- Allows for advanced goroutine dispatching logic. +- Can potentially delay, skip, or merge API calls based on type (e.g., skip `nominatedNodeName` if binding is pending). +- All collisions could be resolved at the queue level, not relying on higher-level mechanisms (like 1.1 or 1.2). +- Allows for (1.3) where all scenarios can be supported without additional structures. +- Provides a framework that can be extended in the future. + +Cons: +- Requires complex logic to handle potential conflicts between different update types for the same pod. +- Needs a clear strategy for how to update the in-memory pod object during scheduling. + + +#### 2.3: Send API calls through a kube-scheduler's cache + +A third approach could be to have a consistent pod state in the kube-scheduler itself first and then change it through the API. +This means that all API calls whould have to go through the kube-scheduler's cache, change the pod there, and after that, execute. +However, pod updates might come from outside the kube-scheduler, e.g., a user changes the spec or something changes the status (if it is even possible). +This extended cache would have to merge the internal state of the pod with the external state, +including the pod update made by the kube-scheduler that will come as an event as well. +Now, the pod object stored in the cache is based only on events that come to the kube-scheduler. + +Another thing to think of is that the cache stores only the bound pods. The rest of the pods is stored in the scheduling queue, +so once again, API calls might need to go through the scheduling queue itself. + +Pros: +- Aims for a consistent internal state of the pod within the kube-scheduler before calling the API, possibly simplifying conflict resolution. + +Cons: +- Requires the cache to handle and merge updates coming from both the kube-scheduler's internal actions and external API events. +- The cache currently only stores bound pods, requiring integration with the scheduling queue for pending pods. +- Complex logic is needed to handle external updates arriving while an internal update is pending or in progress. + + +### Another things worth considering + +- How to handle asynchronous API errors? + + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +## Design Details + + + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: SchedulerAsyncAPICalls + - Components depending on the feature gate: kube-scheduler + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/kep.yaml b/keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/kep.yaml new file mode 100644 index 00000000000..6f3abb9c867 --- /dev/null +++ b/keps/sig-scheduling/5229-asynchronous-api-calls-during-scheduling/kep.yaml @@ -0,0 +1,28 @@ +title: Asynchronous API calls during scheduling +kep-number: 5229 +authors: + - "@macsko" +owning-sig: sig-scheduling +status: implementable +creation-date: 2025-04-08 +reviewers: + - dom4ha + - sanposhiho +approvers: + - alculquicondor + +stage: alpha + +latest-milestone: "v1.34" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.34" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: SchedulerAsyncAPICalls + components: + - kube-scheduler +disable-supported: true