[!]https://docs.google.com/document/d/1R6nR_AweptKE9sJPMdnFxIeO3jDxQfkfBhI2Ld4GCDc/edit?usp=sharing
BRAND NEW CHATBOT UI [!]https://k8chatbot.vercel.app/
[!]https://docs.google.com/presentation/d/1fE-f3UlMdvPvwsPu7tjNVFxIesq9U5uNW-DJhCRIQb4/edit?usp=sharing
FOR THE FRONTEND OF GEMINI LOGGING, A USER CAN SEE IT BEAUTIFULLY WITH A WELL UI. To know more, see below- To run the backend for the gemini remediation and advice setup
python3 src/server.py
To run the frontend
cd frontend/k8s-remediation-dashboard/src
npm run dev
See below for the screenshots-
(PLEASE NOTE- The data is being scraped from prometheus EVERY 5 MINUTES. Model will be best trained with time.)
The vercel app is for the frontend (additional feature) and does not really come under the model training and gemini output. We used prometheus in kubernetes using minikube to scrape the data. We tried to make it a public IP but due to security constraints and few free tier cloud options, we decided to keep it local. If needed, you can run it on your own prometheus and dataset (through src/fetch_live_metrics and data/k8s_live_metrics.csv) The model is under models/k8s_failure_model_live.pkl This has been deployed online. The gemini output and remediation step is under src/predictgemini.py src/jsonextractor.py .(predictgeministreamlit.py was for testing to integrate with streamlit) ALL OF THIS BEAUTIFULLY COMES TOGETHER IN streamlitapp.py in the root directory
(PLEASE NOTE- IT WORKS WITH LOCAL IP, WE COULD NOT RUN PROMETHEUS GLOBALLY AS MENTIONED. BUT PLEASE TRY IT OUT. HENCE THE FETCH METRICS IS WITH THE CURRENT SMALL AMOUNT OF DATA)
Read below to know more about our project.
This project aims to build a machine learning model for predicting Kubernetes cluster failures using real-time and historical cluster metrics. The goal is to identify potential issues in a Kubernetes environment, such as pod/node failures, resource exhaustion, and network issues, before they occur.
The system leverages a variety of tools and libraries, including Prometheus for metrics collection, Python for data processing, and machine learning algorithms to predict failures.
- Project Overview
- System Requirements
- Project Structure
- Setup Instructions
- Usage
- Model Evaluation
- Deployment
- Testing
- Licenses
Kubernetes clusters can face a variety of issues, from pod/node failures to resource exhaustion or network issues. Predicting these failures in advance can help maintain a more stable and efficient cluster. This project includes:
- Data collection from Kubernetes clusters.
- Feature engineering to prepare metrics for machine learning.
- Training of a machine learning model to predict failures.
- Deployment of the model in a Kubernetes environment.
- Evaluation and visualization of the model's performance.
- Python 3.7+
- Prometheus (for fetching live Kubernetes metrics)
- Docker (for containerizing the application)
- Kubernetes (for deployment)
- Machine Learning Libraries:
scikit-learn
,pandas
,numpy
,matplotlib
,joblib
, etc.
kubernetes-failure-prediction/
├── src/ # Code for data collection, model training, and evaluation
│ ├── deployment.yaml # Kubernetes deployment configuration
│ ├── generate_output.py # Generates model output for analysis
│ ├── __pycache__/ # Compiled Python files
│ ├── feature_engineering.py # Script for feature engineering
│ ├── jsonextractor.py # Extracts JSON data for processing
│ ├── test_model.py # Tests for evaluating model performance
│ └── external_data_link.txt # External link to large datasets
├── docs/ # Documentation files
│ └── README.md # This file
├── presentation/ # Slides and recorded demo (YouTube/Drive link)
│ ├── slides.pptx # Slides for the presentation
│ └── demo_link.txt # Link to recorded demo (YouTube/Google Drive)
├── deployment/ # Files for deploying the model to Kubernetes
│ ├── kubernetes_deploy.yaml # Kubernetes deployment configuration
│ └── Dockerfile # Dockerfile for containerizing the model
├── tests/ # Unit and integration tests
├── requirements.txt # Python dependencies
└── LICENSE # License information
git clone https://github.com/your-username/kubernetes-failure-prediction.git
cd kubernetes-failure-prediction
python3 -m venv venv
source venv/bin/activate # On Windows, use venv\Scripts\activate
pip install -r requirements.txt
Ensure that you have Prometheus running and that it's scraping Kubernetes metrics. You can set up Prometheus as per Kubernetes documentation.
You can collect Kubernetes metrics by running:
python src/fetch_live_metrics.py
This will fetch live metrics from your Prometheus instance.
To train the model on your dataset, use the following command:
python src/train_model_live.py
This will train the model on the collected data and output the trained model as failure_predictor.pkl
in the models/
directory.
Once the model is trained, you can use it to predict failures in your Kubernetes cluster:
python src/predictgemini.py
This script will load the trained model and predict potential failures based on real-time metrics.
To evaluate the model's performance, use the following script:
python src/test_model.py
This will test the model on a test dataset and display evaluation metrics such as accuracy, precision, recall, and F1 score.
Use the Dockerfile to containerize the application:
docker build -t k8s-failure-prediction .
You can deploy the model using the Kubernetes configuration in deployment.yaml
:
kubectl apply -f deployment.yaml
This will deploy your model to a Kubernetes cluster. Make sure that your cluster has access to the necessary metrics from Prometheus.
Unit and integration tests are located in the tests/
directory. To run the tests, use:
pytest tests/
This will run all the unit and integration tests to ensure the code is working as expected.
This will run all the unit and integration tests to ensure the code is working as expected.
This will run all the unit and integration tests to ensure the code is working as expected.
This will run all the unit and integration tests to ensure the code is working as expected.
This project is licensed under the MIT License - see the LICENSE file for details.
- Prometheus for real-time metrics collection.
- scikit-learn for machine learning algorithms.
- Kubernetes for orchestration and deployment.
Feel free to contribute to the project or suggest improvements via issues and pull requests. Happy coding!