Enhance compaction metrics by segregating them into various categories #1039

anveshreddy18 · 2025-03-17T14:02:45Z

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

To improve monitoring of the compaction jobs to better understand the reasons for job failures so that necessary targeted action can be taken to minimise such failures. Pls see the issue #1037 for more details.

The metrics etcddruid_compaction_jobs_total and etcddruid_compaction_job_duration_seconds have been enhanced with more values for the label key succeeded. The values earlier were just true & false but now we have true -- for success, false -- for process failure, preempted -- for when the pod is preempted by scheduler, evicted-- for when pod is evicted due to various eviction reasons outlined in #1037 ,deadline-exceeded-- for when pod is unable to finish before theactiveDeadlineSecondsof the job,unknown` -- if failed due to any other unknown reason.

Which issue(s) this PR fixes:
Fixes #1037

Special notes for your reviewer:

To know the reason for the pod failure, we need to look at some of the pod statuses such as DisruptionTarget to see if it's subjected to any disruptions, ContainerStatuses to see if the pod is failed because of process failure, etc. For these, we need access to the pod for some sufficient time, so I've set terminationGracePeriodSeconds for the pod to 60s to give druid enough time to fetch that before it is garbage collected.

Connected to the above, there has been a change in how the status conditions for the job is populated in k8s v1.31.0 with the change being that the Failed condition on the job will now be only added after the termination of all the pods instead of adding it as soon as the pod goes into terminating state. So, basically when the pod is in terminating state waiting for the terminationGracePeriodSeconds, only the FailureTarget condition is set initially, and later the Failed condition is also added once the pod is terminated. Due to this difference in behaviour between the recent k8s versions, I chose to consider both the FailureTarget and Failed conditions as indication of failure here so that it works for both versions smoothly.

Changelog for the k8s v1.32.0 detailing the above mentioned change. Read more here

Delay setting terminal Job conditions until all pods are terminal.
  
  Additionally, the FailureTarget condition is also added to the Job object in the first Job
  status update as soon as the failure conditions are met (backoffLimit is exceeded, maxFailedIndexes, 
  or activeDeadlineSeconds is exceeded).
  
  Similarly, the SuccessCriteriaMet condition is added in the first update as soon as the expected number
  of pod completions is reached.
  
  Also, introduce the following validation rules for Job status when JobManagedBy is enabled:
  1. the count of ready pods is less or equal than active
  2. when transitioning to terminal phase for Job, the number of terminating pods is 0
  3. terminal Job conditions (Failed and Complete) should be preceded by adding the corresponding interim conditions: FailureTarget and SuccessCriteriaMet ([#125510](https://github.com/kubernetes/kubernetes/pull/125510), [@mimowo](https://github.com/mimowo)) [SIG Apps and Testing]

Release note:

compaction job metrics are now enhanced with the reason for failures

anveshreddy18 · 2025-03-20T12:37:28Z

/retest

anveshreddy18 requested a review from a team as a code owner March 17, 2025 14:02

anveshreddy18 self-assigned this Mar 17, 2025

anveshreddy18 added this to the v0.29.0 milestone Mar 17, 2025

gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Mar 17, 2025

gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 17, 2025

gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 17, 2025

gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 19, 2025

gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 19, 2025

anveshreddy18 added 3 commits March 20, 2025 17:56

enhance compaction metrics by segregating them into various categories

3d4eb16

fix tests & restructure

8f0bf0f

Add unit tests

17c041e

anveshreddy18 force-pushed the compaction-metrics branch from f9fec53 to 17c041e Compare March 20, 2025 12:27

gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 20, 2025

anveshreddy18 assigned shreyas-s-rao and unassigned anveshreddy18 Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance compaction metrics by segregating them into various categories #1039

Enhance compaction metrics by segregating them into various categories #1039

anveshreddy18 commented Mar 17, 2025 •

edited

Loading

anveshreddy18 commented Mar 20, 2025

Enhance compaction metrics by segregating them into various categories #1039

Are you sure you want to change the base?

Enhance compaction metrics by segregating them into various categories #1039

Conversation

anveshreddy18 commented Mar 17, 2025 • edited Loading

anveshreddy18 commented Mar 20, 2025

anveshreddy18 commented Mar 17, 2025 •

edited

Loading