Skip to content

Commit

Permalink
feat(ruleLabels): consider common labels
Browse files Browse the repository at this point in the history
  • Loading branch information
richardtief committed Oct 9, 2024
1 parent a23b9c4 commit dba7084
Show file tree
Hide file tree
Showing 7 changed files with 45 additions and 93 deletions.
2 changes: 1 addition & 1 deletion charts/kubernetes-operations/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

apiVersion: v2
name: kubernetes-operations
version: 0.0.8
version: 0.0.9
description: A set of Plutono dashboards and Prometheus alerting rules combined with playbooks to ensure effective operations of Kubernetes.
maintainers:
- name: Richard Tief (I520251)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,7 @@ groups:
labels:
severity: {{ dig "KubernetesApiServerDown" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubernetesApiServerDown.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Kubernetes API server has disappeared from Prometheus target discovery.
summary: Target disappeared from Prometheus target discovery.
Expand All @@ -31,9 +29,7 @@ groups:
labels:
severity: {{ dig "KubernetesApiServerLatency" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubernetesApiServerLatency.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: ApiServerLatency for `{{`{{ $labels.resource }}`}}` is higher then usual for the past {{ dig "KubernetesApiServerDown" "for" "30m" .Values.prometheusRules }} minutes. Inspect apiserver logs for the root cause.
summary: ApiServerLatency is unusually high.
Expand All @@ -55,9 +51,7 @@ groups:
labels:
severity: {{ dig "KubeAggregatedAPIDown" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubeAggregatedAPIDown.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Kubernetes aggregated API `{{`{{ $labels.namespace }}`}}/{{`{{ $labels.name }}`}}` has been only `{{`{{ $value | humanizePercentage }}`}}` available over the last {{ dig "KubeAggregatedAPIDown" "for" "5m" .Values.prometheusRules }} . Run `kubectl get apiservice | grep -v Local` and confirm the services of aggregated APIs have active endpoints.
summary: Kubernetes aggregated API is down.
Expand Down
32 changes: 8 additions & 24 deletions charts/kubernetes-operations/alerts/kubernetes-health.alerts.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,7 @@ groups:
labels:
severity: {{ dig "KubeStateMetricsScrapeFailed" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubeStateMetricsScrapeFailed.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Failed to scrape kube-state-metrics. Metrics on the cluster state might be outdated.
summary: kube-state-metrics scrape failed.
Expand All @@ -23,9 +21,7 @@ groups:
labels:
severity: {{ dig "KubernetesNodeManyNotReady" "severity" "critical" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubernetesManyNodesNotReady.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: "`{{`{{ $value }}`}}` nodes are `NotReady` for more than an hour."
summary: Many Nodes are NotReady.
Expand All @@ -38,9 +34,7 @@ groups:
labels:
severity: {{ dig "KubernetesNodeNotReady" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubernetesNodeNotReady.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
summary: Node status is NotReady.
description: Node `{{`{{ $labels.node }}`}}` is NotReady for more than an hour.
Expand All @@ -53,9 +47,7 @@ groups:
labels:
severity: {{ dig "KubernetesNodeReadinessFlapping" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubernetesNodeReadinessFlapping.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Node `{{`{{ $labels.node }}`}}` is flapping between Ready and NotReady.
summary: Node readiness status is flapping.
Expand All @@ -68,9 +60,7 @@ groups:
labels:
severity: {{ dig "KubernetesPodRestartingTooMuch" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubernetesPodRestartingTooMuch.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Container `{{`{{ $labels.container }}`}}` of pod `{{`{{ $labels.namespace }}/{{ $labels.pod }}`}}` is restarting constantly.
summary: Pod is in a restart loop.
Expand All @@ -84,9 +74,7 @@ groups:
labels:
severity: {{ dig "KubernetesTooManyOpenFiles" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubernetesTooManyOpenFiles.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: "`{{`{{ $labels.job }}`}}` on `{{`{{ $labels.node }}`}}` is using `{{`{{ $value }}%`}}` of the available `file/socket` descriptors."
summary: Too many open file descriptors.
Expand All @@ -108,9 +96,7 @@ groups:
labels:
severity: {{ dig "KubernetesDeploymentReplicasMismatch" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubernetesDeploymentReplicasMismatch.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Deployment `{{`{{ $labels.namespace }}/{{ $labels.deployment }}`}}` has not matched the expected number of replicas for longer than 10 minutes.
summary: Deployment has not matched the expected number of replicas.
Expand All @@ -135,9 +121,7 @@ groups:
labels:
severity: {{ dig "KubePodNotReady" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubePodNotReady.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Pod `{{`{{ $labels.namespace }}/{{ $labels.pod }}`}}` has been in a non-ready state for longer than {{ dig "KubePodNotReady" "for" "30m" .Values.prometheusRules }} minutes.
summary: Pod has been in a non-ready state for more than {{ dig "KubePodNotReady" "for" "30m" .Values.prometheusRules }} minutes.
Expand Down
28 changes: 7 additions & 21 deletions charts/kubernetes-operations/alerts/kubernetes-kubelet.alerts.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,7 @@ groups:
labels:
severity: {{ dig "KubernetesManyKubeletsDown" "severity" "critical" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubernetesManyKubeletsDown.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Many Kubelets are DOWN.
summary: More than 4 Kubelets are DOWN.
Expand All @@ -28,9 +26,7 @@ groups:
labels:
severity: {{ dig "KubeletDown" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubeletDown.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Kublet on `{{`{{ $labels.node }}`}}` is DOWN.
summary: A Kubelet is DOWN.
Expand All @@ -52,9 +48,7 @@ groups:
labels:
severity: {{ dig "KubeletTooManyPods" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubeletTooManyPods.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Kubelet `{{`{{ $labels.node }}`}}` is running at `{{`{{ $value | humanizePercentage }}`}}` of its Pod capacity.
summary: Kubelet is running at capacity.
Expand All @@ -76,9 +70,7 @@ groups:
labels:
severity: {{ dig "KubeletFull" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubeletFull.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Kubelet is full, no more pods can be scheduled on `{{`{{ $labels.node }}`}}`.
summary: Kubelet is full.
Expand All @@ -91,9 +83,7 @@ groups:
labels:
severity: {{ dig "KubeletHighNumberOfGoRoutines" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubeletHighNumberOfGoRoutines.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Kublet on `{{`{{ $labels.node }}`}}` might be unresponsive due to a high number of Go routines.
summary: High number of Go routines.
Expand All @@ -106,9 +96,7 @@ groups:
labels:
severity: {{ dig "KubeletHighNumberOfGoRoutinesPredicted" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubeletHighNumberOfGoRoutinesPredicted.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: Kublet on `{{`{{$labels.node}}`}}` might become unresponsive due to a high number of go routines within 2 hours.
summary: Predicting high number of Go routines.
Expand All @@ -130,9 +118,7 @@ groups:
labels:
severity: {{ dig "KubeletManyRequestErrors" "severity" "warning" .Values.prometheusRules }}
runbook_url: https://github.com/cloudoperators/kubernetes-operations/playbooks/KubeletManyRequestErrors.md
{{- if .Values.prometheusRules.additionalRuleLabels }}
{{- toYaml .Values.prometheusRules.additionalRuleLabels | nindent 6 }}
{{- end }}
{{ include "kubernetes-operations.additionalRuleLabels" . | nindent 6 }}
annotations:
description: "`{{`{{ $value | humanizePercentage }}`}}` of requests from kubelet on `{{`{{ $labels.node }}`}}` are erroneous."
summary: Many HTTP 5xx responses for Kubelet requests.
Expand Down
Loading

0 comments on commit dba7084

Please sign in to comment.