Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance compaction metrics by segregating them into various categories #1039

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

anveshreddy18
Copy link
Contributor

@anveshreddy18 anveshreddy18 commented Mar 17, 2025

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

To improve monitoring of the compaction jobs to better understand the reasons for job failures so that necessary targeted action can be taken to minimise such failures. Pls see the issue #1037 for more details.

The metrics etcddruid_compaction_jobs_total and etcddruid_compaction_job_duration_seconds have been enhanced with more values for the label key succeeded. The values earlier were just true & false but now we have true -- for success, false -- for process failure, preempted -- for when the pod is preempted by scheduler, evicted-- for when pod is evicted due to various eviction reasons outlined in #1037 ,deadline-exceeded-- for when pod is unable to finish before theactiveDeadlineSecondsof the job,unknown` -- if failed due to any other unknown reason.

Which issue(s) this PR fixes:
Fixes #1037

Special notes for your reviewer:

To know the reason for the pod failure, we need to look at some of the pod statuses such as DisruptionTarget to see if it's subjected to any disruptions, ContainerStatuses to see if the pod is failed because of process failure, etc. For these, we need access to the pod for some sufficient time, so I've set terminationGracePeriodSeconds for the pod to 60s to give druid enough time to fetch that before it is garbage collected.

Connected to the above, there has been a change in how the status conditions for the job is populated in k8s v1.31.0 with the change being that the Failed condition on the job will now be only added after the termination of all the pods instead of adding it as soon as the pod goes into terminating state. So, basically when the pod is in terminating state waiting for the terminationGracePeriodSeconds, only the FailureTarget condition is set initially, and later the Failed condition is also added once the pod is terminated. Due to this difference in behaviour between the recent k8s versions, I chose to consider both the FailureTarget and Failed conditions as indication of failure here so that it works for both versions smoothly.

Changelog for the k8s v1.32.0 detailing the above mentioned change. Read more here

Delay setting terminal Job conditions until all pods are terminal.
  
  Additionally, the FailureTarget condition is also added to the Job object in the first Job
  status update as soon as the failure conditions are met (backoffLimit is exceeded, maxFailedIndexes, 
  or activeDeadlineSeconds is exceeded).
  
  Similarly, the SuccessCriteriaMet condition is added in the first update as soon as the expected number
  of pod completions is reached.
  
  Also, introduce the following validation rules for Job status when JobManagedBy is enabled:
  1. the count of ready pods is less or equal than active
  2. when transitioning to terminal phase for Job, the number of terminating pods is 0
  3. terminal Job conditions (Failed and Complete) should be preceded by adding the corresponding interim conditions: FailureTarget and SuccessCriteriaMet ([#125510](https://github.com/kubernetes/kubernetes/pull/125510), [@mimowo](https://github.com/mimowo)) [SIG Apps and Testing]

Release note:

compaction job metrics are now enhanced with the reason for failures

@anveshreddy18 anveshreddy18 requested a review from a team as a code owner March 17, 2025 14:02
@anveshreddy18 anveshreddy18 self-assigned this Mar 17, 2025
@anveshreddy18 anveshreddy18 added this to the v0.29.0 milestone Mar 17, 2025
@gardener-robot gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Mar 17, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 17, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 17, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 19, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 19, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 20, 2025
@anveshreddy18
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance compaction Job metrics to expose the reason for Job Failures
5 participants