-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Prometheus metrics for LocalQueue #3673
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: KPostOffice The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @KPostOffice. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Currently I've implemented this with just a boolean feature gate. I was having trouble figuring out how to pass namespace selector details down to the |
@@ -146,6 +146,9 @@ type ControllerMetrics struct { | |||
// metrics will be reported. | |||
// +optional | |||
EnableClusterQueueResources bool `json:"enableClusterQueueResources,omitempty"` | |||
|
|||
// +optional | |||
EnableLocalQueueMetrics bool `json:"enableLocalQueueMetrics,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to favor API rather than a feature gate? We don't guard other metrics by API. So, I don't see such a need, but let us know if there is something specific about them. If the concern is stability of the system due to potential bugs, then feature gate is enough, we can start from alpha. It would also allow us to simplify the code as feature gate status can be checked from any place, so no need to pass parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much agree, especially when it comes to passing parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a comment about increasing cardinality and wanting to leave this behind a long term config field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but in that case I would like to go via the KEP process. Pity the comment does not mention why cardinality is a problem - is it for usability (this could be solved by aggregation), or performance. Do you have some other references why cardinality might be a problem in k8s.
I assume we don't have many more LQs than namepaces, which also let me check what we do in the core k8s. I see that we have metrics depending on Namespace, example. However, in this case we use explicitly CounterOpts.Namespace. Maybe we could also do it this way? PTAL.
If you want this feature in 0.10 I think the only chance is a short KEP, don't change API, and guard it by Alpha feature gate (disabled by default). Then for second iteration of alpha investigate if we need the API switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The namespace
in the example you link isn't a K8 namespace from what I understand. It is the project namespace to avoid prometheus
metrics clashing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I thought I found such an example in the k8s l, but I was wrong.. seeing no such metrics in k8s suggests that indeed it might be better not to multiply the metrics by namespace. DISCLAIMER: I haven't done extensive search, just looked at a couple places
Since we have such a use case in Kueue I would be ok with the API knob, but anyway a KEP would be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://kubernetes.io/docs/reference/instrumentation/metrics/
There's a few metrics here with a namespace label.
There's an existing KEP that I was having a bit of trouble implementing since it included the ability to use namespace/local_queue
selectors for metric collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://kubernetes.io/docs/reference/instrumentation/metrics/
There's a few metrics here with a namespace label.
Interesting, are these metrics opt-in or enabled by default? If k8s core enables them by default I don't think we need to worry. I would like to better understand why cardinality is a problem basically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to update the KEP and hide the mretics behind alpha feature gate. This will not impact users /customers and we don't commit to maintain the API. Then as graduation point for beta we re-evaluate both approaches
EDIT: to be clear I'm hesitant, maybe it is actually ok to just preemptively prevent very large outputs from the metrics endpoint. So, maybe the API is fine, I will look tomorrow. Cc @tenzen-y.
'status' can have the following values: | ||
- "active" means that the workloads are in the admission queue. | ||
- "inadmissible" means there was a failed admission attempt for these workloads and they won't be retried until cluster conditions, which could make this workload admissible, change`, | ||
}, []string{"local_queue", "namespace", "status"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is acceptable, but let me consider other options:
- "local_queue", "namespace" - as in the proposal
- "name", "namespace"
- "local_queue" - key "namespace/name"
I'm not in favor of (3) because maybe for some use-cases one wants to aggregate metrics by LQ name rather than full key.
My only slight preference for (2.) is that it is less redundant. It is already clear from the metrics name that we are talking LQs. This is not the case for the pending_workloads
metrics for CQs, so I think we don't need to follow the naming pattern for params strictly here. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with 2
@@ -250,6 +250,11 @@ func (c *clusterQueue) updateQueueStatus() { | |||
if status != c.Status { | |||
c.Status = status | |||
metrics.ReportClusterQueueStatus(c.Name, c.Status) | |||
if lqMetrics { | |||
for _, lq := range c.localQueues { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This iteration might be adding unnecessary performance cost. What is the scenario that it needs calling here? Maybe we could move the call per LQ, when we update the specific LQ. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lq status is equal to the cq status. So when the cq status updates, all the cq's associated lqs should have their statuses updated as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall very nice to see this contribution, and I would like to include it in 0.10 if time allows. Left some comments about major things which draw my attention during initial pass. I would also like to see some integration tests - I think for most of the metrics we should be able to extend the tests were we check metrics for CQs.
@KPostOffice in the release note, please list all the metrics and their shortened description / purpose. |
/ok-to-test |
/retitle Implement Prometheus metrics for LocalQueue |
pkg/queue/cluster_queue.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move changes applied to this file to pkg/queue/local_queue.go
?
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
fa02f8b
to
671f425
Compare
@KPostOffice: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
Implementation of LQ metrics KEP
this replaces PR #3609
Which issue(s) this PR fixes:
Fixes #1833
Special notes for your reviewer:
I'm uncertain if I've updated all the metrics in the right places. I still need to write tests, but I figured I'd open the PR as I have it now in case anything is egregiously off.
Does this PR introduce a user-facing change?