Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2936: LocalQueue defaulting #3652

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yaroslava-serdiuk
Copy link
Contributor

@yaroslava-serdiuk yaroslava-serdiuk commented Nov 26, 2024

What type of PR is this?

/kind documentation

What this PR does / why we need it:

issue #2936

Does this PR introduce a user-facing change?

NONE

cc: @mimowo @mwielgus @PBundyra

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 26, 2024
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 26, 2024
Copy link

netlify bot commented Nov 26, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit e7d7ee6
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/674d82330945750008bfde43


## Design Details

Update defautling webhook for all types of job to add `kueue.x-k8s.io/queue-name:default`
Copy link
Contributor

@mimowo mimowo Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on the proposed approach? The options I see:

  1. default the queue-name label on the Job object in a webhook
  2. default only at the workload spec.queueName level

The webhook-based approach will not work OOTB for external Jobs. Also, with the MAP effort we will want to move away from webhooks over time. On the plus side we have compatibility with tooling.

(2.) would work OOTB for external Jobs. Also, (2.) is less committal, we could always do (1.) in a follow up if needed.

Maybe the issue for (2.) is the compatibility with other tooling like kueuectl (not sure where it looks now), but we could probably adjust them to lookup the workload.

If we go with (1.) we should provide some utility functions so that it is easy for other developers to follow the semantics for the external jobs.

All in all, I'm not sure which one is better, so would be good to put list the pros and cons, and put the rejected one to alternatives describing the reasons for rejecting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The webhook-based approach will not work OOTB for external Jobs.

Could you elaborate why it won't work?

Also, with the MAP effort we will want to move away from webhooks over time.

What is MAP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why it won't work?

It will work, but not OOTB - the users need to implement the same semantics in their webhooks. AppWrapper is one of the frameworks, but there are also other in-house Jobs.

What is MAP?

Mutating Admission Policies. I think we could actually move the approach from webhook to MAP when it is released.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, (2.) may not work OOTB either - because if the webhook is not aware of the logic, then it would suspend the job without queue-name when manageJobsWithoutName: true (IIUC)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, (2.) may not work OOTB either - because if the webhook is not aware of the logic, then it would suspend the job without queue-name when manageJobsWithoutName: true (IIUC)

yes, this is my understanding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going to be important that the webhook only do this for Jobs defined in namespaces where a designated default LocalQueue already exists. If the webhook acts without namespace awareness, then we still need the namespace-level filtering mechanism from KEP-3589

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the webhook should be namespace aware, but the awareness would come from KEP-3589, so that we don't duplicate namespace filtering effort.

Copy link
Contributor Author

@yaroslava-serdiuk yaroslava-serdiuk Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I check whether the defaultqueue exists in the namespace: #3610

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this code does not filter based on the namespaces. If we have filtering, then Kueue would stay away from namespaces which don't pass the filter, even if "default" LQ exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with (1.) we should provide some utility functions so that it is easy for other developers to follow the semantics for the external jobs.

I think that this should be enough for the in-house jobs. Additionally, we might want to expand the customJob documentation (https://kueue.sigs.k8s.io/docs/tasks/dev/integrate_a_custom_job/).

Basically, I think that natively supporting custom Jobs is not necessary in here's upstream repository, and we can focus on the upstream supported Jobs like batch/v1 Job, RayJob, and KFJobs.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting feature.

Makefile-deps.mk Outdated Show resolved Hide resolved
keps/2936-local-queue-defaulting/README.md Outdated Show resolved Hide resolved

### Risks and Mitigations

## Design Details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the reserved default name for the default localQueue, could we scrape the default queue-name from the cluster?

For example, we might want to prepare the dedicated .spec.default: true|false field in the LocalQueue, and then the webhook looks into the default queue-name when the webhook insert the queue-name to Jobs.

This idea is similar to storageClass.

Copy link
Contributor Author

@yaroslava-serdiuk yaroslava-serdiuk Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.spec.default: true|false field in the LocalQueue

Having default field in the API brings it owns challenge, I mentioned it earlier in the issue: #2936 (comment), i.e. validating that only one LQ is default.

Since the goal is to simplify workflow, it seems having default LQ is simple and explicit.
However I'm not sure about having a defaulting under the feature gate. In case the feature goes to GA and users are not familiar with the defaulting mechanism, they may create the LQ with default name and jobs without queue label and expecting a different behaviour.
Maybe instead of feature gate it should be a part of kueue config.

Copy link
Contributor Author

@yaroslava-serdiuk yaroslava-serdiuk Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This idea is similar to storageClass.

StorageClass allows multiple defaults and the latest default is an actual default. I also considered this as an alternative (will add it to alternatives section).

I'm thinking about use case. 1) In the case described in the issue, when admin creates only one LQ that is supposed to use, then both solutions (default LQ and spec.default=true) would work fine.
2) In case when there many LQs and one of them is default, it seems having default LQ is better, because it easier to find a default LQ among hundreds of LQs in namespace and modify (not sure if it's possible, maybe recreate) in case of misconfiguration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
I think that the custom default LQ name is still valid request, but it seems that the customization will require an additional mechanism like comparing creationTimestamp between multiple defaults LQs.

So, I agree with the fixed default LQ name, "default" so that we can start this feature as a minimum.
After we obtain some feedbacks for the custom default LQ name, we can revisit here, and consider the customization.

@yaroslava-serdiuk yaroslava-serdiuk force-pushed the lq-default-kep branch 2 times, most recently from 0de1708 to dbbc994 Compare November 27, 2024 13:09
@yaroslava-serdiuk yaroslava-serdiuk changed the title Add kep for LQ defauling KEP-2936: LocalQueue defaulting Nov 27, 2024
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM, would be just nice to add the short alternatives section to capture the main discussion points.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 28, 2024
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, lgtm

necessary to implement this enhancement.

## Alternatives

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add one more alternative, "default only at the workload spec.queueName level"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
This is still not addressed, yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@mimowo
Copy link
Contributor

mimowo commented Nov 29, 2024

/approve
/hold
for the alternative section update and lgtm from @tenzen-y or @PBundyra once ready

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 29, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, yaroslava-serdiuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 29, 2024
@PBundyra
Copy link
Contributor

/lgtm
/hold cancel
Thank you @yaroslava-serdiuk

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 29, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 29, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 53400e0bdd987d4ccf650c20b6c8b575f33f4359

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 29, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 2, 2024
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/documentation Categorizes issue or PR as related to documentation. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants