-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2936: LocalQueue defaulting #3652
base: main
Are you sure you want to change the base?
KEP-2936: LocalQueue defaulting #3652
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
efc4dd2
to
be70ff5
Compare
|
||
## Design Details | ||
|
||
Update defautling webhook for all types of job to add `kueue.x-k8s.io/queue-name:default` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on the proposed approach? The options I see:
- default the
queue-name
label on the Job object in a webhook - default only at the workload
spec.queueName
level
The webhook-based approach will not work OOTB for external Jobs. Also, with the MAP effort we will want to move away from webhooks over time. On the plus side we have compatibility with tooling.
(2.) would work OOTB for external Jobs. Also, (2.) is less committal, we could always do (1.) in a follow up if needed.
Maybe the issue for (2.) is the compatibility with other tooling like kueuectl (not sure where it looks now), but we could probably adjust them to lookup the workload.
If we go with (1.) we should provide some utility functions so that it is easy for other developers to follow the semantics for the external jobs.
All in all, I'm not sure which one is better, so would be good to put list the pros and cons, and put the rejected one to alternatives describing the reasons for rejecting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The webhook-based approach will not work OOTB for external Jobs.
Could you elaborate why it won't work?
Also, with the MAP effort we will want to move away from webhooks over time.
What is MAP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate why it won't work?
It will work, but not OOTB - the users need to implement the same semantics in their webhooks. AppWrapper is one of the frameworks, but there are also other in-house Jobs.
What is MAP?
Mutating Admission Policies. I think we could actually move the approach from webhook to MAP when it is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, (2.) may not work OOTB either - because if the webhook is not aware of the logic, then it would suspend the job without queue-name when manageJobsWithoutName: true (IIUC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, (2.) may not work OOTB either - because if the webhook is not aware of the logic, then it would suspend the job without queue-name when manageJobsWithoutName: true (IIUC)
yes, this is my understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going to be important that the webhook only do this for Jobs defined in namespaces where a designated default LocalQueue
already exists. If the webhook acts without namespace awareness, then we still need the namespace-level filtering mechanism from KEP-3589
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the webhook should be namespace aware, but the awareness would come from KEP-3589, so that we don't duplicate namespace filtering effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I check whether the default
queue exists in the namespace: #3610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this code does not filter based on the namespaces. If we have filtering, then Kueue would stay away from namespaces which don't pass the filter, even if "default" LQ exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with (1.) we should provide some utility functions so that it is easy for other developers to follow the semantics for the external jobs.
I think that this should be enough for the in-house jobs. Additionally, we might want to expand the customJob documentation (https://kueue.sigs.k8s.io/docs/tasks/dev/integrate_a_custom_job/).
Basically, I think that natively supporting custom Jobs is not necessary in here's upstream repository, and we can focus on the upstream supported Jobs like batch/v1 Job, RayJob, and KFJobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting feature.
|
||
### Risks and Mitigations | ||
|
||
## Design Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using the reserved default
name for the default localQueue, could we scrape the default queue-name from the cluster?
For example, we might want to prepare the dedicated .spec.default: true|false
field in the LocalQueue, and then the webhook looks into the default queue-name when the webhook insert the queue-name to Jobs.
This idea is similar to storageClass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.spec.default: true|false field in the LocalQueue
Having default field in the API brings it owns challenge, I mentioned it earlier in the issue: #2936 (comment), i.e. validating that only one LQ is default.
Since the goal is to simplify workflow, it seems having default
LQ is simple and explicit.
However I'm not sure about having a defaulting under the feature gate. In case the feature goes to GA and users are not familiar with the defaulting mechanism, they may create the LQ with default name and jobs without queue label and expecting a different behaviour.
Maybe instead of feature gate it should be a part of kueue config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This idea is similar to storageClass.
StorageClass allows multiple defaults and the latest default is an actual default. I also considered this as an alternative (will add it to alternatives section).
I'm thinking about use case. 1) In the case described in the issue, when admin creates only one LQ that is supposed to use, then both solutions (default
LQ and spec.default=true) would work fine.
2) In case when there many LQs and one of them is default, it seems having default
LQ is better, because it easier to find a default
LQ among hundreds of LQs in namespace and modify (not sure if it's possible, maybe recreate) in case of misconfiguration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
I think that the custom default LQ name is still valid request, but it seems that the customization will require an additional mechanism like comparing creationTimestamp between multiple defaults LQs.
So, I agree with the fixed default LQ name, "default" so that we can start this feature as a minimum.
After we obtain some feedbacks for the custom default LQ name, we can revisit here, and consider the customization.
0de1708
to
dbbc994
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM, would be just nice to add the short alternatives section to capture the main discussion points.
dbbc994
to
2e08797
Compare
2e08797
to
cc67ffd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, lgtm
necessary to implement this enhancement. | ||
|
||
## Alternatives | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add one more alternative, "default only at the workload spec.queueName level"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
This is still not addressed, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, yaroslava-serdiuk The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
LGTM label has been added. Git tree hash: 53400e0bdd987d4ccf650c20b6c8b575f33f4359
|
cc67ffd
to
e7d7ee6
Compare
New changes are detected. LGTM label has been removed. |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
issue #2936
Does this PR introduce a user-facing change?
cc: @mimowo @mwielgus @PBundyra