A Real-Time Kubernetes Health Prediction & Monitoring System
by Team ClusterBusters
KubeBoom is a real-time monitoring and predictive analytics tool for Kubernetes clusters. It leverages machine learning to detect pod health anomalies before they escalate into failures. The system integrates Prometheus for metrics collection, a live React-based dashboard for visualization, and a Generative AI backend that provides natural language explanations for any detected issues. Key features include live pod health classification, human-readable alert descriptions, and a dynamic, interactive dashboard for continuous cluster monitoring.
Check out the working demo of our project here:
- β± 5:58 β Chaos Engineering with LitmusChaos
- β± 18:49 β MongoDB as the Backend
- β± 22:03 β KubeBoom Dashboard & LLM Integration
Note: If the demo video doesn't open, please try opening it in incognito mode.
To run and manage the system in Phase II:
git clone https://github.com/CS-Amritha/DT.git
cd DT
pip install -r requirments.txt
cd frontend
npm install
cd ..
make up
Kubeboom UI: http://localhost:8080
Kubeboom Backend: http://localhost:8000
make down
make prune
make test-model
Project Structure
βββ dataset # Contains the .csv data generated using our script
βββ presentation # Final Presentation
βββ models # models in .pkl format
βββ src # Source code directory
βββ test # Python and bash shell script to test the model
|ββ archive # Past work (code, datasets, models, docs)
|ββ litmus_chaos # LitmusChaos YAMLs for chaos creation and admin config
|ββ frontend # The frontend components
|ββ flow_diagrams # Contains flow diagrams for live data collection and data collection processes
βββ README.md # This file
This phase includes live monitoring of Kubernetes clusters, predictive health classification of pods, dynamic UI visualization, and natural language explanations using GenAI for improved observability and decision-making.
- Predict Kubernetes pod health (Good / Alert / Bad) using trained ML models on live Prometheus metrics
- Display health status in real time on a modern, reactive dashboard
- Integrate with a Large Language Model (LLM) to provide context-aware, human-readable explanations for unhealthy pods
We use Prometheus for scraping and querying real-time Kubernetes metrics:
- Built-in exporters like
kube-state-metrics
andnode-exporter
ensure efficient, out-of-the-box monitoring - Prometheus provides a rich query interface (PromQL) to extract relevant pod-level data used for predictions
Trained ML models receive Prometheus metrics and classify each pod as:
Classification | Description |
---|---|
Good | Stable and healthy pod |
Alert | Under mild resource stress |
Bad | Crashed or heavily stressed pod |
Predictions are refreshed continuously, allowing users to catch issues as they evolve.
To store prediction results and alert history, we use MongoDB:
- Faster querying of flexible alert data compared to relational databases
- Acts as a central store for both the UI dashboard and LLM context retrieval
- Stores metadata like timestamps, affected pods, metrics at the time of alert
The UI dashboard visualizes real-time pod health using a responsive, performant frontend stack:
- React + Vite + TypeScript: Fast development with type safety
- Tailwind CSS: Utility-first styling for rapid UI design
- shadcn-ui: Clean, accessible UI components built on top of Radix
- Real-time grid view of pod health status with color-coded indicators
- Auto-refreshing display of alerts and metrics
- Panel to drill into pod-specific metrics and historical anomalies
For better observability, a GenAI-based explanation engine interprets and explains alerts in plain English:
- google.generativeai (Gemini): Generates explanations based on metrics and logs
- FastAPI: Lightweight backend serving prompt routes
- MongoDB: Context store for alert data
- dotenv, Pydantic, Pathlib, Routers, and HTTPException for scalable API design
- Prometheus provides rich, live metrics
- MongoDB is optimized for fast, schema-less alert queries
- Gemini LLM enhances usability with natural language explanations
- FastAPI ensures modular and scalable prompt handling
- Prometheus scrapes metrics from the Kubernetes cluster
- Backend fetches and feeds data into ML models
- Predictions are stored in MongoDB
- Dashboard fetches updated health states and renders them live
- LLM retrieves context from MongoDB and generates natural language explanations
- Real-time, predictive Kubernetes monitoring
- Dynamic visualization of pod health and alerts
- Human-friendly explanations for improved DevOps response
- Scalable and modular backend and frontend architecture
Kubernetes clusters can encounter failures such as pod crashes, resource bottlenecks, and network issues. The challenge in Phase 1 is to build an AI/ML model capable of predicting these issues before they occur by analysing historical and real-time cluster metrics.
- Node or pod failures
- Resource exhaustion (CPU, memory, disk)
- Network or connectivity issues
- Service disruptions based on logs and events
The Kubernetes cluster is logically divided into namespaces to:
- Create isolated test environments
- Ensure clean and interference-free data collection
Using LitmusChaos, we simulate realistic failure conditions to test pod resilience:
- Pod crashes
- Resource exhaustion (CPU, memory)
- Network delays and disruptions
These simulations replicate real-world stress environments, enabling our model to learn from diverse pod behaviors.
Each pod is categorized into one of the following classes based on its health status during chaos testing:
Classification | Description |
---|---|
Good | Stable and healthy pod |
Alert | Under mild resource stress |
Bad | Crashed or heavily stressed pod |
We utilize Prometheus to collect real-time pod-level metrics during chaos tests, such as:
- CPU and memory usage
- Network I/O
- Pod restarts
- Latency and availability
Each pod instance is labeled accordingly as Good, Alert, or Bad based on these metrics.
All collected metrics, along with their labels, are exported into structured CSV files. This forms the core dataset used to:
- Train ML models
- Evaluate their accuracy and performance
π©π©π©π©π©π©π©π©π©β¬ 90%
π©π©π©π©π©π©π©π©π©β¬ 90%
π©π©π©π©π©π©π©π©π©β¬ 90%