Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local queues prometheus metrics #1833

Open
1 of 3 tasks
Tracked by #3599
astefanutti opened this issue Mar 13, 2024 · 23 comments · May be fixed by #3673
Open
1 of 3 tasks
Tracked by #3599

Local queues prometheus metrics #1833

astefanutti opened this issue Mar 13, 2024 · 23 comments · May be fixed by #3673
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@astefanutti
Copy link
Member

astefanutti commented Mar 13, 2024

What would you like to be added:

Expose Prometheus metrics for local queues, equivalent to the existing cluster queue metrics, but filtered and labeled by local queues.

Similarly to the visibility API, that serves information about pending workloads in local queues, it would be possible to get metrics like like pending workloads, admitted active workloads, resource usage, etc, for local queues.

If cardinality is a concern, those metrics could be exposed behind a feature flag.

Why is this needed:

Metrics about local queues can be useful for the batch users persona, so they can have visibility and historical trends about their workloads.

While some metrics are already available for cluster queues, exposing them to the batch users persona presents the following challenges / limitations:

  • Cluster queues metrics are global and cannot be filtered by namespaces / tenants
  • Querying "cluster-scoped" metrics in secured Prometheus instances is generally only authorised for cluster admin users that have access to all namespaces / tenants

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@astefanutti astefanutti added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 13, 2024
@astefanutti
Copy link
Member Author

@alculquicondor @tenzen-y Do you think that'd be a useful / possible enhancement?

@alculquicondor
Copy link
Contributor

alculquicondor commented Mar 13, 2024

If cardinality is a concern, those metrics could be exposed behind a feature flag.

Yes, that's the primary concern. I wouldn't make it a feature flag, but a long-term configuration field.

@tenzen-y
Copy link
Member

@alculquicondor @tenzen-y Do you think that'd be a useful / possible enhancement?

I understand that this feature is so useful, but I have the same concern with @alculquicondor.
IIRC, previously, we had similar discussions when we designed Visibility/Visibility On-demand.
So, if we introduce this feature, configurable this feature by Config API would be better.

Anyway, I guess that having a small KEP would be better since we may extend the existing Config API.

@astefanutti
Copy link
Member Author

That makes sense. There could be one or two options added to the Config API, similar to the existing .metrics.enableClusterQueueResources, like .metrics.enableLocalQueues and .metrics.enableLocalQueueResources.

I can work on a small KEP if you guys give the green light.

@alculquicondor
Copy link
Contributor

It seems simple enough.

@tenzen-y
Copy link
Member

SGTM

@astefanutti
Copy link
Member Author

Thanks for your quick feedback! I'll work on it asap.

/assign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2024
@alculquicondor
Copy link
Contributor

@astefanutti are you still looking into this?

@astefanutti
Copy link
Member Author

@alculquicondor I haven't, but hopefully we'll get back to it soon.

@astefanutti
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2024
@astefanutti
Copy link
Member Author

/unassign

varshaprasad96 added a commit to varshaprasad96/kueue that referenced this issue Jul 2, 2024
This PR introduces an enhancement to enable collection of
prometheus metrics for local queues.

Addresses issue: kubernetes-sigs#1833

Signed-off-by: Varsha Prasad Narsing <[email protected]>
varshaprasad96 added a commit to varshaprasad96/kueue that referenced this issue Jul 2, 2024
This PR introduces an enhancement to enable collection of
prometheus metrics for local queues.

Addresses issue: kubernetes-sigs#1833

Signed-off-by: Varsha Prasad Narsing <[email protected]>
@alculquicondor
Copy link
Contributor

@varshaprasad96, please write /assign in a comment to claim this issue. It's important to communicate that you are working on an issue so that other contributors don't try to work on the same thing.

@varshaprasad96
Copy link
Member

/assign

k8s-ci-robot pushed a commit that referenced this issue Jul 29, 2024
* [Feature] Enable prometheus metrics for local queues

This PR introduces an enhancement to enable collection of
prometheus metrics for local queues.

Addresses issue: #1833

Signed-off-by: Varsha Prasad Narsing <[email protected]>

* Address reviews

This commit addresses reviews by adding additional metrics
for local queue.

Signed-off-by: Varsha Prasad Narsing <[email protected]>

---------

Signed-off-by: Varsha Prasad Narsing <[email protected]>
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2024
@tenzen-y
Copy link
Member

tenzen-y commented Oct 8, 2024

/remove-lifecycle stale

@varshaprasad96 Do you still work on this enhancement?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2024
@varshaprasad96
Copy link
Member

@tenzen-y Yes. I'm planning to get the implementation PR up by next few days.

@tenzen-y
Copy link
Member

tenzen-y commented Oct 8, 2024

@tenzen-y Yes. I'm planning to get the implementation PR up by next few days.

Awsome, thanks for your effort!

@KPostOffice
Copy link
Contributor

/assign

@KPostOffice
Copy link
Contributor

I've been working on the this KEP for the last couple days and overall it seemed pretty straightforward, until I got to adding the LocalQueueByStatus metric since afaict the status metric for CQs is a bubbling up of the internal cq.status. The only representation of LQ states exists inside the CQ struct, and I felt like adding all LQs to the cache struct felt wrong. My current thought is to add a new Type in metrics LocalQueueStatus which has the following values (active, pending, and orphaned) where the LQ will inherit the status from its CQ parent. If the parent is terminating the LQ status will move to pending and if the CQ is deleted then the status would move to orphaned. When reconciling the LQ to update status I can just directly grab the status from the CQ in cache.

@KPostOffice
Copy link
Contributor

cc @mimowo @PBundyra @tenzen-y

kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this issue Nov 19, 2024
…#2516)

* [Feature] Enable prometheus metrics for local queues

This PR introduces an enhancement to enable collection of
prometheus metrics for local queues.

Addresses issue: kubernetes-sigs#1833

Signed-off-by: Varsha Prasad Narsing <[email protected]>

* Address reviews

This commit addresses reviews by adding additional metrics
for local queue.

Signed-off-by: Varsha Prasad Narsing <[email protected]>

---------

Signed-off-by: Varsha Prasad Narsing <[email protected]>
@varshaprasad96
Copy link
Member

varshaprasad96 commented Nov 19, 2024

@KPostOffice Could you elaborate on where exactly in localqueue_controller are you trying to update the metrics. IIUC - the local queue reconciler watches the CQ, and any status updates would be reflected in here.

I felt like adding all LQs to the cache struct felt wrong

Also, the local_queue has reference to the respective cluster queue. Can't we just directly query CQ's cache status instead while reporting metrics?

@KPostOffice
Copy link
Contributor

KPostOffice commented Nov 19, 2024

@varshaprasad96 I wasn't planning on adding the metrics to the localqueue_controller I was adding them to either manager.go or cache so that the metrics were reflective of Kueue's operational state, this is what is done for CQ status metrics, see here. I'm having trouble figuring out how to exactly represent the LQ's status. I think it can just update when either:

  1. a LQ is added
  2. a LQ is updated
  3. the LQ's underlying CQ status is updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
7 participants