RAGO is a tool for modeling and evaluating the performance and efficiency of retrieval-augmented generation (RAG) serving. This repository is an open-source implementation for the ISCA'25 paper RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving.
RAG is a genAI paradigm that combines generative LLMs and retrievals:
RAGO is a system performance optimization framework. It searches for the optimal system configurations (task placement, resource allocation, and batching policy) based on the specific RAG algorithms and underlying hardware:
RAGO takes two sets of performance files as inputs: (a) LLM inference performance results and (b) retrieval performance results. Both LLM and retrieval performance can be either (a) profiled on real machines or (b) modeled by simulators such as Generative LLM Analyzer. It then evaluates the end-to-end performance Pareto frontier (e.g., time-to-first-token latency, throughput, etc.) by assembling inference and retrieval.
conda create --name rago python=3.11
conda activate rago
pip install -r requirements.txt
We provided some example RAG pipelines. The description of these RAG workloads can be found in examples/README.md.
To run some examples of RAG pipelines:
cd examples
python example_case1.py
The output performance figures can be found in examples/img.
Here we use GenZ as an example simulator to produce some performance results based on different hardware and models. You can find a more detailed description in llm_sim/README.md.
To use them:
# Initialization
git submodule update --init --recursive
cd llm_sim/genz
git checkout cb2448332a1a83eec52cd6e3b7919d56eaff380c
pip install -r requirements.txt
# Run the analysis
cd ../genz_scripts
python llm_perf.py
The results can be found in llm_sim/genz_scripts/perf_results.
Any other simulators or real profiles can be looped in, as long as they produce the same performance csv format.
The retrieval performance model and its usage is decribed in retrieval_sim/README.md. The performance model is based on the ScaNN vector search library.
cd retrieval_sim
python retrieval_perf.py
We extend our gratitude towards Cliff Young, David Culler, and Eugene Le for reviewing the paper and providing insightful feedback. We also thank the extended team at Google DeepMind and System Research@Google who enabled and supported this research direction.
Code in this Github repository is licensed under a APACHE 2.0 License.
@inproceedings{rago:isca:2025,
title={RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving},
author={Jiang, Wenqi and Subramanian, Suvinay and Graves, Cat and Alonso, Gustavo and Yazdanbakhsh, Amir and Dadu, Vidushi},
booktitle = {Proceedings of the 52th Annual International Symposium on Computer Architecture}
year={2025}
}
In addition to RAGO, there are some related works about improving RAG serving performance, specifically:
PipeRAG addresses performance optimization for RAG with iterative retrieval by algorithm- and system-level improvements.
Chameleon is a heterogeneous accelerator system for RAG serving. It prototypes FPGA-based accelerators for retrieval and runs LLM inference on GPUs.
FANNS accelerates product-quantization-based vector search.
Falcon accelerates graph-based vector search.
This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.