Enhance compaction metrics by segregating them into various categories #1039
+567
−38
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
How to categorize this PR?
/area monitoring
/kind enhancement
What this PR does / why we need it:
To improve monitoring of the compaction jobs to better understand the reasons for job failures so that necessary targeted action can be taken to minimise such failures. Pls see the issue #1037 for more details.
The metrics
etcddruid_compaction_jobs_total
andetcddruid_compaction_job_duration_seconds
have been enhanced with more values for the label keysucceeded
. The values earlier were justtrue
&false
but now we havetrue
-- for success,false
-- for process failure,preempted
-- for when the pod is preempted by scheduler,
evicted-- for when pod is evicted due to various eviction reasons outlined in #1037 ,
deadline-exceeded-- for when pod is unable to finish before the
activeDeadlineSecondsof the job,
unknown` -- if failed due to any other unknown reason.Which issue(s) this PR fixes:
Fixes #1037
Special notes for your reviewer:
To know the reason for the pod failure, we need to look at some of the pod statuses such as
DisruptionTarget
to see if it's subjected to any disruptions,ContainerStatuses
to see if the pod is failed because of process failure, etc. For these, we need access to the pod for some sufficient time, so I've setterminationGracePeriodSeconds
for the pod to60s
to give druid enough time to fetch that before it is garbage collected.Connected to the above, there has been a change in how the status conditions for the job is populated in k8s
v1.31.0
with the change being that theFailed
condition on the job will now be only added after the termination of all the pods instead of adding it as soon as the pod goes intoterminating
state. So, basically when the pod is interminating
state waiting for theterminationGracePeriodSeconds
, only theFailureTarget
condition is set initially, and later theFailed
condition is also added once the pod is terminated. Due to this difference in behaviour between the recent k8s versions, I chose to consider both theFailureTarget
andFailed
conditions as indication of failure here so that it works for both versions smoothly.Changelog for the k8s
v1.32.0
detailing the above mentioned change. Read more hereRelease note: