Implement Prometheus metrics for LocalQueue #3673

KPostOffice · 2024-11-27T18:22:06Z

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Implementation of LQ metrics KEP

this replaces PR #3609

Which issue(s) this PR fixes:

Fixes #1833

Special notes for your reviewer:

I'm uncertain if I've updated all the metrics in the right places. I still need to write tests, but I figured I'd open the PR as I have it now in case anything is egregiously off.

Does this PR introduce a user-facing change?

Addition of configuration that allows users to get prometheus metrics about LQ info, including the LQ status and the status of pending workloads

k8s-ci-robot · 2024-11-27T18:22:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KPostOffice
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-11-27T18:22:16Z

Hi @KPostOffice. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2024-11-27T18:22:22Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`671f425`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/674deab687806f000857b129
😎 Deploy Preview	https://deploy-preview-3673--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

KPostOffice · 2024-11-27T19:08:28Z

Currently I've implemented this with just a boolean feature gate. I was having trouble figuring out how to pass namespace selector details down to the scheduler, cache, and queue and then also act on those selector values. I didn't want to introduce client calls to them since I figured making network request would pretty severely hurt performance in those packages.

mimowo · 2024-11-28T07:34:22Z

apis/config/v1beta1/configuration_types.go

@@ -146,6 +146,9 @@ type ControllerMetrics struct {
 	// metrics will be reported.
 	// +optional
 	EnableClusterQueueResources bool `json:"enableClusterQueueResources,omitempty"`
+
+	// +optional
+	EnableLocalQueueMetrics bool `json:"enableLocalQueueMetrics,omitempty"`


What is the reason to favor API rather than a feature gate? We don't guard other metrics by API. So, I don't see such a need, but let us know if there is something specific about them. If the concern is stability of the system due to potential bugs, then feature gate is enough, we can start from alpha. It would also allow us to simplify the code as feature gate status can be checked from any place, so no need to pass parameters.

I very much agree, especially when it comes to passing parameters

There was a comment about increasing cardinality and wanting to leave this behind a long term config field

I see, but in that case I would like to go via the KEP process. Pity the comment does not mention why cardinality is a problem - is it for usability (this could be solved by aggregation), or performance. Do you have some other references why cardinality might be a problem in k8s.

I assume we don't have many more LQs than namepaces, which also let me check what we do in the core k8s. I see that we have metrics depending on Namespace, example. However, in this case we use explicitly CounterOpts.Namespace. Maybe we could also do it this way? PTAL.

If you want this feature in 0.10 I think the only chance is a short KEP, don't change API, and guard it by Alpha feature gate (disabled by default). Then for second iteration of alpha investigate if we need the API switch.

The namespace in the example you link isn't a K8 namespace from what I understand. It is the project namespace to avoid prometheus metrics clashing

I see, I thought I found such an example in the k8s l, but I was wrong.. seeing no such metrics in k8s suggests that indeed it might be better not to multiply the metrics by namespace. DISCLAIMER: I haven't done extensive search, just looked at a couple places

Since we have such a use case in Kueue I would be ok with the API knob, but anyway a KEP would be useful

https://kubernetes.io/docs/reference/instrumentation/metrics/

There's a few metrics here with a namespace label.

There's an existing KEP that I was having a bit of trouble implementing since it included the ability to use namespace/local_queue selectors for metric collection.

https://kubernetes.io/docs/reference/instrumentation/metrics/

There's a few metrics here with a namespace label.

Interesting, are these metrics opt-in or enabled by default? If k8s core enables them by default I don't think we need to worry. I would like to better understand why cardinality is a problem basically

I would suggest to update the KEP and hide the mretics behind alpha feature gate. This will not impact users /customers and we don't commit to maintain the API. Then as graduation point for beta we re-evaluate both approaches

EDIT: to be clear I'm hesitant, maybe it is actually ok to just preemptively prevent very large outputs from the metrics endpoint. So, maybe the API is fine, I will look tomorrow. Cc @tenzen-y.

mimowo · 2024-11-28T07:43:19Z

pkg/metrics/metrics.go

+'status' can have the following values:
+- "active" means that the workloads are in the admission queue.
+- "inadmissible" means there was a failed admission attempt for these workloads and they won't be retried until cluster conditions, which could make this workload admissible, change`,
+		}, []string{"local_queue", "namespace", "status"},


I think this is acceptable, but let me consider other options:

"local_queue", "namespace" - as in the proposal

"name", "namespace"

"local_queue" - key "namespace/name"

I'm not in favor of (3) because maybe for some use-cases one wants to aggregate metrics by LQ name rather than full key.

My only slight preference for (2.) is that it is less redundant. It is already clear from the metrics name that we are talking LQs. This is not the case for the pending_workloads metrics for CQs, so I think we don't need to follow the naming pattern for params strictly here. WDYT?

I'm happy with 2

mimowo · 2024-11-28T07:48:47Z

pkg/cache/clusterqueue.go

@@ -250,6 +250,11 @@ func (c *clusterQueue) updateQueueStatus() {
 	if status != c.Status {
 		c.Status = status
 		metrics.ReportClusterQueueStatus(c.Name, c.Status)
+		if lqMetrics {
+			for _, lq := range c.localQueues {


This iteration might be adding unnecessary performance cost. What is the scenario that it needs calling here? Maybe we could move the call per LQ, when we update the specific LQ. PTAL.

The lq status is equal to the cq status. So when the cq status updates, all the cq's associated lqs should have their statuses updated as well

mimowo

Overall very nice to see this contribution, and I would like to include it in 0.10 if time allows. Left some comments about major things which draw my attention during initial pass. I would also like to see some integration tests - I think for most of the metrics we should be able to extend the tests were we check metrics for CQs.

cc @tenzen-y @dgrove-oss @PBundyra

mimowo · 2024-11-28T08:14:24Z

@KPostOffice in the release note, please list all the metrics and their shortened description / purpose.

mbobrovskyi · 2024-11-28T10:09:16Z

/ok-to-test

tenzen-y · 2024-11-28T16:46:33Z

/retitle Implement Prometheus metrics for LocalQueue

PBundyra · 2024-11-29T10:20:35Z

pkg/queue/cluster_queue.go

Can we move changes applied to this file to pkg/queue/local_queue.go?

Signed-off-by: Kevin <[email protected]>

k8s-ci-robot · 2024-12-02T17:25:26Z

@KPostOffice: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kueue-verify-main	`671f425`	link	true	`/test pull-kueue-verify-main`
pull-kueue-test-multikueue-e2e-main	`671f425`	link	true	`/test pull-kueue-test-multikueue-e2e-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot requested review from denkensk and PBundyra November 27, 2024 18:22

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 27, 2024

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 27, 2024

mimowo reviewed Nov 28, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 28, 2024

tenzen-y mentioned this pull request Nov 28, 2024

Implement/local metrics #3609

Closed

k8s-ci-robot changed the title ~~Lq metrics~~ Implement Prometheus metrics for LocalQueue Nov 28, 2024

PBundyra reviewed Nov 29, 2024

View reviewed changes

pkg/queue/cluster_queue.go Outdated

Copy link

Contributor

PBundyra Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move changes applied to this file to pkg/queue/local_queue.go?

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 2, 2024

KPostOffice added 2 commits December 2, 2024 12:13

add LocalQueue metrics (no feature gate)

5a78a30

Signed-off-by: Kevin <[email protected]>

add all clear and report calls

3901168

Signed-off-by: Kevin <[email protected]>

KPostOffice added 3 commits December 2, 2024 12:13

add feature gate

a2c952d

Signed-off-by: Kevin <[email protected]>

cleanup todos and add more feature gates

9493af3

Signed-off-by: Kevin <[email protected]>

use feature gate instead of config

671f425

Signed-off-by: Kevin <[email protected]>

KPostOffice force-pushed the lq-metrics branch from fa02f8b to 671f425 Compare December 2, 2024 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Prometheus metrics for LocalQueue #3673

Implement Prometheus metrics for LocalQueue #3673

KPostOffice commented Nov 27, 2024

k8s-ci-robot commented Nov 27, 2024

k8s-ci-robot commented Nov 27, 2024

netlify bot commented Nov 27, 2024 •

edited

Loading

KPostOffice commented Nov 27, 2024

mimowo Nov 28, 2024 •

edited

Loading

PBundyra Nov 29, 2024

KPostOffice Nov 29, 2024

mimowo Dec 2, 2024

KPostOffice Dec 2, 2024

mimowo Dec 2, 2024 •

edited

Loading

KPostOffice Dec 2, 2024

mimowo Dec 2, 2024 •

edited

Loading

mimowo Dec 2, 2024 •

edited

Loading

mimowo Nov 28, 2024

KPostOffice Dec 2, 2024

mimowo Nov 28, 2024

KPostOffice Dec 2, 2024 •

edited

Loading

mimowo left a comment

mimowo commented Nov 28, 2024

mbobrovskyi commented Nov 28, 2024

tenzen-y commented Nov 28, 2024

PBundyra Nov 29, 2024

k8s-ci-robot commented Dec 2, 2024

Implement Prometheus metrics for LocalQueue #3673

Are you sure you want to change the base?

Implement Prometheus metrics for LocalQueue #3673

Conversation

KPostOffice commented Nov 27, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Nov 27, 2024

k8s-ci-robot commented Nov 27, 2024

netlify bot commented Nov 27, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue ready!

KPostOffice commented Nov 27, 2024

mimowo Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

mimowo Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KPostOffice Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

mimowo left a comment

Choose a reason for hiding this comment

mimowo commented Nov 28, 2024

mbobrovskyi commented Nov 28, 2024

tenzen-y commented Nov 28, 2024

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 2, 2024

netlify bot commented Nov 27, 2024 •

edited

Loading

mimowo Nov 28, 2024 •

edited

Loading

mimowo Dec 2, 2024 •

edited

Loading

mimowo Dec 2, 2024 •

edited

Loading

mimowo Dec 2, 2024 •

edited

Loading

KPostOffice Dec 2, 2024 •

edited

Loading