Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2936: LocalQueue defaulting #3652
base: main
Are you sure you want to change the base?
KEP-2936: LocalQueue defaulting #3652
Changes from all commits
e7d7ee6
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using the reserved
default
name for the default localQueue, could we scrape the default queue-name from the cluster?For example, we might want to prepare the dedicated
.spec.default: true|false
field in the LocalQueue, and then the webhook looks into the default queue-name when the webhook insert the queue-name to Jobs.This idea is similar to storageClass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having default field in the API brings it owns challenge, I mentioned it earlier in the issue: #2936 (comment), i.e. validating that only one LQ is default.
Since the goal is to simplify workflow, it seems having
default
LQ is simple and explicit.However I'm not sure about having a defaulting under the feature gate. In case the feature goes to GA and users are not familiar with the defaulting mechanism, they may create the LQ with default name and jobs without queue label and expecting a different behaviour.
Maybe instead of feature gate it should be a part of kueue config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StorageClass allows multiple defaults and the latest default is an actual default. I also considered this as an alternative (will add it to alternatives section).
I'm thinking about use case. 1) In the case described in the issue, when admin creates only one LQ that is supposed to use, then both solutions (
default
LQ and spec.default=true) would work fine.2) In case when there many LQs and one of them is default, it seems having
default
LQ is better, because it easier to find adefault
LQ among hundreds of LQs in namespace and modify (not sure if it's possible, maybe recreate) in case of misconfiguration.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
I think that the custom default LQ name is still valid request, but it seems that the customization will require an additional mechanism like comparing creationTimestamp between multiple defaults LQs.
So, I agree with the fixed default LQ name, "default" so that we can start this feature as a minimum.
After we obtain some feedbacks for the custom default LQ name, we can revisit here, and consider the customization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on the proposed approach? The options I see:
queue-name
label on the Job object in a webhookspec.queueName
levelThe webhook-based approach will not work OOTB for external Jobs. Also, with the MAP effort we will want to move away from webhooks over time. On the plus side we have compatibility with tooling.
(2.) would work OOTB for external Jobs. Also, (2.) is less committal, we could always do (1.) in a follow up if needed.
Maybe the issue for (2.) is the compatibility with other tooling like kueuectl (not sure where it looks now), but we could probably adjust them to lookup the workload.
If we go with (1.) we should provide some utility functions so that it is easy for other developers to follow the semantics for the external jobs.
All in all, I'm not sure which one is better, so would be good to put list the pros and cons, and put the rejected one to alternatives describing the reasons for rejecting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate why it won't work?
What is MAP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will work, but not OOTB - the users need to implement the same semantics in their webhooks. AppWrapper is one of the frameworks, but there are also other in-house Jobs.
Mutating Admission Policies. I think we could actually move the approach from webhook to MAP when it is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, (2.) may not work OOTB either - because if the webhook is not aware of the logic, then it would suspend the job without queue-name when manageJobsWithoutName: true (IIUC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is my understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going to be important that the webhook only do this for Jobs defined in namespaces where a designated default
LocalQueue
already exists. If the webhook acts without namespace awareness, then we still need the namespace-level filtering mechanism from KEP-3589There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the webhook should be namespace aware, but the awareness would come from KEP-3589, so that we don't duplicate namespace filtering effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I check whether the
default
queue exists in the namespace: #3610There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this code does not filter based on the namespaces. If we have filtering, then Kueue would stay away from namespaces which don't pass the filter, even if "default" LQ exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should be enough for the in-house jobs. Additionally, we might want to expand the customJob documentation (https://kueue.sigs.k8s.io/docs/tasks/dev/integrate_a_custom_job/).
Basically, I think that natively supporting custom Jobs is not necessary in here's upstream repository, and we can focus on the upstream supported Jobs like batch/v1 Job, RayJob, and KFJobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add one more alternative, "default only at the workload spec.queueName level"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
This is still not addressed, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated