Skip to content

Xeftax/cpm-rag

Repository files navigation

Project Overview

This project leverages Retrieval-Augmented Generation (RAG) technology to create an expert agent specialized in the CPM (Communication Protocol Management) Standard. The goal is to assist in the development of a C++ library for storing conversations in line with the CPM Message Storage standards. The project started as a personal exploration of RAG technology and has evolved into a customizable pipeline that can be adapted to any dataset or RAG task.

Problem Addressed

The CPM standard is vast, with numerous norms and documents, such as RFCs and other related standards, each containing complex interrelations, use cases, and requirements. It is challenging to navigate through these documents and quickly find relevant information to ensure compliance with the standard. This project aims to build an expert agent capable of interpreting, retrieving, and verifying compliance with the CPM standard's intricate details.

How it Works

The RAG system begins by ingesting documents, typically in PDF format, which are converted to Markdown for easier processing. The documents are then split into smaller chunks, which are enriched using AI models to extract keywords, summarize paragraphs, and link related content. This enriched data is stored in a Chroma DB for efficient retrieval using a combination of vector similarity and BM25 search techniques.

The pipeline is triggered by changes in a Git repository, ensuring that the system stays up-to-date with new or modified documents. It then applies a cleaner to standardize the Markdown format, ensuring the correct hierarchical structure of headers. The system processes and embeds the data into a database, making it easily searchable for related information.

Technologies Used

  • LlamaIndex: Model API call management, RAG, and data pipeline architecture.
  • PydanticAI: AI agents (enricher) for data enhancement.
  • Chroma DB: Vector database for storing embeddings.
  • BM25 & Vector Search: Hybrid retrieval model for efficient document search.
  • Git: Version control to trigger data pipeline on updates.

Key Features

  • Customizable Pipeline: Allows easy integration of new data sources, transformers, and databases for RAG tasks.
  • Data Enrichment: Uses AI agents to enhance data with keywords, summaries, and relevant questions.
  • Search & Retrieval: Combines vector similarity and BM25 for more accurate results.
  • Conversation History: Supports multi-message discussions to enhance context in responses.

Challenges Faced

One of the main challenges was cleaning the input data, especially Markdown documents, using AI agents. Initially, the structured data output wasn't always reliable, which made it difficult to achieve perfect results in cleaning. However, iterative improvements and testing helped overcome these challenges.

Future Plans

Next steps involve enhancing data enrichment, integrating Knowledge Graphs for better handling of document relations, and creating autonomous agents to manage the database and identify new information to expand the dataset. The system will also support multi-message conversations for more contextually rich responses.

Testing and Results

So far, the system has been tested with a set of CPM standards documents. A simple PyQt-based chatbot app has been developed to interact with the system, providing answers based on the data. Results have been encouraging, with the agent returning coherent responses for known domains. The system tracks the stages of query generation, source retrieval, and answer generation to ensure optimal performance.

Project Strategy

This project is a data pipeline designed to process various types of data through a series of transformers. The strategy for this project includes the following key points:

  • Data Pipeline: The core of the project is a data pipeline.
  • Transformers: The pipeline will consist of transformers, each responsible for processing a specific type of data.
  • Raw File Handling: Whenever a raw file is uploaded, the appropriate transformer will be executed to handle it.
  • Processed Data Flow: The transformer will create a new file with the processed data, which will then trigger the next transformer in the sequence until the final stage is reached.
  • Updates and Additions: Whenever a transformer is updated or a new one is added, all impacted files will be reprocessed to ensure consistency and accuracy.

This strategy ensures a modular and scalable approach to data processing, allowing for easy updates and additions to the pipeline.

Git Strategy

Pre-commit

  • Discard all log changes if not auto-commit.

Post-commit

  • Check for changes in the data directory:
    • D (Deleted): Rerun the transformer for the file.
    • M (Modified): Rerun the transformer for the file.
    • A (Added): Run the transformer for the file.
    • R (Renamed): Rename all files with the new name.
  • Check for changes in the transformers directory:
    • D (Deleted): Delete the transformer directory with the same name.
    • M (Modified): Rerun the transformer on all files in the directory.
    • A (Added): Run the transformer on all files.
    • R (Renamed): Rename the transformer directory with the new name.
  • Commit the changes with the message auto: reprocess the files.

Development

  • Perform the same checks but with staged files when running the pipeline.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages