-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Enable prometheus metrics for local queues #2516
[Feature] Enable prometheus metrics for local queues #2516
Conversation
This PR introduces an enhancement to enable collection of prometheus metrics for local queues. Addresses issue: kubernetes-sigs#1833 Signed-off-by: Varsha Prasad Narsing <[email protected]>
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
b73eee4
to
49f4265
Compare
@astefanutti @alculquicondor @tenzen-y Could you please take a look at the proposal and provide your inputs. Thank you! |
/assign @PBundyra |
/lgtm |
LGTM label has been added. Git tree hash: b3460149244f8070ae015034e8bba83771759ecc
|
/hold |
is this ready for a pass from approvers? |
Yes |
because they are global and cannot be filtered by namespaces. Furthermore, accessing cluster-scoped metrics | ||
in secured Prometheus instances is generally restricted to cluster admin users with cluster level permissions across all namespaces and | ||
tenants. This restriction makes it challenging for batch users to obtain the specific metrics they need for effective workload | ||
management and to gain insights into their workloads within their limited scope of access. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this KEP also specify how we will prevent users with insufficient permission accessing metrics they shouldn't be able to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this proposal is not suggesting how to prevent access from namespaces that shouldn't have the permission, then I would remove this phrase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, maybe we could narrow the scope of this KEP @varshaprasad96 ? At first glance managing permissions seems to be challenging, or it would require external mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply! I've been considering various options, and it's clear that publishing local queue metrics in each namespace could lead to complications. This approach would require multiple endpoints with respective Service Monitors or, if using a single central Service Monitor, we would need the correct RBAC setup to allow client access. The centralised approach could still poses some issues, such as namespace admins potentially being able to view metrics from other namespaces if not provided with right SA.
One potential solution for cluster admins is to scaffold out a Service Monitor (SM) with metrics labeling for specific namespaces from the same service endpoint. This would enable a common service endpoint for all local queues. Admins could then provide a service account with right RBAC to specific batch user, restricting their access to that particular service monitor.
However, this solution is difficult to implement in Kueue right away by figuring out the right set of scaffolds, and seems to be the responsibility of the cluster admin. This could probably be a customisation which can be documented for now.
That being said, given the complexity, this topic may probably warrant a further brainstorming and deserves a separate KEP. For now, I'll remove this reference and update it to indicate that for this iteration metrics will be exported in the controller namespace, alongside cluster queue metrics, at the same endpoint. Does that sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does.
You can remove the per-namespace topic from the motivation and add a note in non-goals.
bc697cc
to
15ef983
Compare
@PBundyra @alculquicondor @tenzen-y I've updated the proposal based on the reviews. Please take a look when you get a chance. Thank you! |
15ef983
to
9ac3656
Compare
Thanks you @varshaprasad96! Please address my latest suggestions, and besides that LGTM @alculquicondor @tenzen-y Please take a look |
/approve I'll leave LGTM to @PBundyra |
9d3d40d
to
a1320e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Leave lgtm on @PBundyra
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, tenzen-y, varshaprasad96 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This commit addresses reviews by adding additional metrics for local queue. Signed-off-by: Varsha Prasad Narsing <[email protected]>
a1320e1
to
be1601a
Compare
@PBundyra I've addressed your open comments. If everything looks good, could you please approve the PR. Thanks! |
Great work, thanks all! |
LGTM label has been added. Git tree hash: 70b33a04bb64be859510af26def073f20d30ffdb
|
@PBundyra apologies I missed the final comments and did not see you've been given the final LGTM. |
@PBundyra Just wanted to check if anything else is required to get the proposal merged? Thanks! |
Sorry for the late response, I was out of office for the last couple of days. Thanks for the KEP @varshaprasad96, great work! |
…#2516) * [Feature] Enable prometheus metrics for local queues This PR introduces an enhancement to enable collection of prometheus metrics for local queues. Addresses issue: kubernetes-sigs#1833 Signed-off-by: Varsha Prasad Narsing <[email protected]> * Address reviews This commit addresses reviews by adding additional metrics for local queue. Signed-off-by: Varsha Prasad Narsing <[email protected]> --------- Signed-off-by: Varsha Prasad Narsing <[email protected]>
What type of PR is this?
/kind feature
/kind documentation
What this PR does / why we need it:
This PR introduces an enhancement to enable collection of prometheus metrics for local queues.
Addresses issue: #1833
Which issue(s) this PR fixes:
Fixes # Partially #1833
Special notes for your reviewer:
Does this PR introduce a user-facing change?