RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

RAGO is a tool for modeling and evaluating the performance and efficiency of retrieval-augmented generation (RAG) serving. This repository is an open-source implementation for the ISCA'25 paper RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving.

RAG is a genAI paradigm that combines generative LLMs and retrievals:

RAGO is a system performance optimization framework. It searches for the optimal system configurations (task placement, resource allocation, and batching policy) based on the specific RAG algorithms and underlying hardware:

RAGO takes two sets of performance files as inputs: (a) LLM inference performance results and (b) retrieval performance results. Both LLM and retrieval performance can be either (a) profiled on real machines or (b) modeled by simulators such as Generative LLM Analyzer. It then evaluates the end-to-end performance Pareto frontier (e.g., time-to-first-token latency, throughput, etc.) by assembling inference and retrieval.

Getting Started

Install

conda create --name rago python=3.11
conda activate rago

pip install -r requirements.txt

Example RAG performance analysis

We provided some example RAG pipelines. The description of these RAG workloads can be found in examples/README.md.

To run some examples of RAG pipelines:

cd examples
python example_case1.py

The output performance figures can be found in examples/img.

Inference Performance

Here we use GenZ as an example simulator to produce some performance results based on different hardware and models. You can find a more detailed description in llm_sim/README.md.

To use them:

# Initialization
git submodule update --init --recursive
cd llm_sim/genz
git checkout cb2448332a1a83eec52cd6e3b7919d56eaff380c
pip install -r requirements.txt

# Run the analysis
cd ../genz_scripts
python llm_perf.py

The results can be found in llm_sim/genz_scripts/perf_results.

Any other simulators or real profiles can be looped in, as long as they produce the same performance csv format.

Retrieval Performance

The retrieval performance model and its usage is decribed in retrieval_sim/README.md. The performance model is based on the ScaNN vector search library.

cd retrieval_sim
python retrieval_perf.py

🙏 Acknowledgements

We extend our gratitude towards Cliff Young, David Culler, and Eugene Le for reviewing the paper and providing insightful feedback. We also thank the extended team at Google DeepMind and System Research@Google who enabled and supported this research direction.

📄 License

Code in this Github repository is licensed under a APACHE 2.0 License.

🎓 Citing RAGO

@inproceedings{rago:isca:2025,
  title={RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving},
  author={Jiang, Wenqi and Subramanian, Suvinay and Graves, Cat and Alonso, Gustavo and Yazdanbakhsh, Amir and Dadu, Vidushi},
  booktitle = {Proceedings of the 52th Annual International Symposium on Computer Architecture}
  year={2025}
}

In addition to RAGO, there are some related works about improving RAG serving performance, specifically:

[KDD'25] PipeRAG: Fast retrieval-augmented generation via adaptive pipeline parallelism

PipeRAG addresses performance optimization for RAG with iterative retrieval by algorithm- and system-level improvements.

[VLDB'25] Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

Chameleon is a heterogeneous accelerator system for RAG serving. It prototypes FPGA-based accelerators for retrieval and runs LLM inference on GPUs.

[SC'23] Co-design Hardware and Algorithm for Vector Search

FANNS accelerates product-quantization-based vector search.

[VLDB'25] Fast Graph Vector Search via Hardware Acceleration and Delayed-Synchronization Traversal

Falcon accelerates graph-based vector search.

This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
docs		docs
examples		examples
figures		figures
llm_sim		llm_sim
rago		rago
retrieval_sim		retrieval_sim
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

Getting Started

Install

Example RAG performance analysis

Inference Performance

Retrieval Performance

🙏 Acknowledgements

📄 License

🎓 Citing RAGO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

google/rago

Folders and files

Latest commit

History

Repository files navigation

RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

Getting Started

Install

Example RAG performance analysis

Inference Performance

Retrieval Performance

🙏 Acknowledgements

📄 License

🎓 Citing RAGO

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages