Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat (monitoring): [alerts] enable new recommended experience for aks clusters #435

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

Pod level alert: at least one Job instance did not complete successfully
for the last 6 hours.
Pod level alert: The average CPU usage per container exceeds 95%
for the last 5 minutes.
…erAverageMemoryHigh

Pod level alert: The average memory usage per container exceeds 95% for
the last 5 minutes
Pod level alert: One or more pods is in a failed state for the last 5
minutes
Platform level alert Node cpu percentage is replacing this
Node level alert: A node has been unreachable for the last 15 minutes
Platform level alert Node memory working set percentage is greater than 100% is
replacing this
…edCount

Cluster level alert: One or more containers within pods have been killed
due to out-of-memory (OOM) events for the last 5 minutes
Pod level alert: The average usage of Persistent Volumes (PVs)
on pod exceeds 80% for the last 15 minutes
Pod level alert: The percentage of pods in a ready state falls below 80%
for any deployment or daemonset in the Kubernetes cluster for the last 5 minutes
…rRestart

Pod level alert: One or more containers within pods in the Kubernetes
cluster have been restarted at least once within the last hour
@ferantivero ferantivero marked this pull request as ready for review November 9, 2024 02:16
Copy link
Contributor

@johndowns johndowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ferantivero thanks for this! I can see a ton of work here - thanks so much. The changes all make sense.

We should ensure we link to the Recommended alert rules for Kubernetes clusters in the reference architecture too, to make sure it's clear where these came from.

@ferantivero
Copy link
Contributor Author

#sign-off please let's consider merging this once we landed the desired changes at the RA level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants