This project is a student project for the AMOS SS 2024 course with the industry partner Kubermatic at Technical University of Berlin,Friedrich-Alexander University of Erlangen-Nuremberg and Free University of Berlin, under the supervision of Prof. Riehle and contact persons are Mario Fahlandt and Sebastian Scheele of Kubermatic.
Welcome to the Cloud Native LLM Project for the AMOS SS 2024!
This project aims to simplify the Cloud Native ecosystem by resolving information overload and fragmentation within the CNCF landscape.
Our vision is a future where developers and users can effortlessly obtain detailed, context-aware answers about CNCF projects, thereby boosting productivity and enhancing comprehension.
The development of this project follows an open-source and open-model fashion.
The folder structure is as follows: [TBD]
- First 1.
- Second 2.
- Third: 3.
- Select and Train an Open Source LLM: Identify a suitable open source LLM for training with specific Kubernetes-related data.
- Automate Data Extraction: Develop tools to automatically gather training data from publicly available Kubernetes resources such as white papers, documentation, and forums.
- Incorporate Advanced Data Techniques: Use concepts and relationship extraction to enrich the training dataset, enhancing the LLM's understanding of Kubernetes.
- Open Source Contribution: Release the fine-tuned model and dataset preparation tools. Potentially work in tandem with the AMOS project on knowledge graph extraction to synergize both projects’ outcomes.
- Benchmark Development: Construct a manual benchmark to serve as ground truth for quantitatively evaluating the LLM's performance.
- Data Sources: Collect documentation from CNCF landscape project documentation, white papers, blog posts, and technical documents.
- Preprocessing: Normalize and structure the collected data.
- Knowledge Extraction: Use Named Entity Recognition (NER) to extract key entities and create relationships between them.
- LLM Selection: Evaluate and select an appropriate open-source/open-model LLM based on performance, computational requirements, and licensing.
- Fine-tuning Procedure: Use the structured dataset for model training in a repeatable and reproducible manner, ideally using Cloud Native tools like KubeFlow and Kubernetes.
- Quantitative Metrics: Use specific benchmarks such as BLEU score and Factual Question Accuracy to assess model performance.
- Qualitative Evaluation: Domain experts and project maintainers will evaluate the LLM’s comprehensiveness, accuracy, and clarity.
This project aims to become a definitive knowledge base for cloud computing, enriching the knowledge of engineers in cloud-native development and supporting the maintenance and growth of open-source projects.
To get started: [TBD]