Skip to content

The official GitHub page for the survey paper "A Survey on Multimodal Retrieval-Augmented Generation".

Notifications You must be signed in to change notification settings

PanguIR/MRAGSurvey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

MRAGSurvey

A collection of papers and resources related to Large Language Models.

The organization of papers refers to our survey "A Survey on Multimodal Retrieval-Augmented Generation".

Please let us know if you find out a mistake or have any suggestions by e-mail: [email protected]

(we suggest ccing another email [email protected] meanwhile, in case of any unsuccessful delivery issue.)

If you find our survey useful for your research, please cite the following paper:

@article{MRAGSurvey,
    title={A Survey on Multimodal Retrieval-Augmented Generation},
    author={Lang Mei, Siyu Mo, Zhihan Yang, Chong Chen},
    year={2025},
    organization={GitHub},
    url={https://github.com/PanguIR/MRAGSurvey},
}

Overview of MRAG

MRAG1.0

The architecture of MRAG1.0, often termed "pseudo-MRAG", closely resembles traditional RAG, consisting of three modules: Document Parsing and Indexing, Retrieval, and Generation. While the overall process remains largely unchanged, the key distinction lies in the Document Parsing stage. In this stage, specialized models are employed to convert diverse modal data into modality-specific captions. These captions are then stored alongside textual data for utilization in subsequent stages.

MRAG1.0

MRAG2.0

The architecture of MRAG2.0 retains multimodal data through document parsing and indexing, while introducing multimodal retrieval and MLLMs for answer generation, truly entering the multimodal era.

MRAG2.0

MRAG3.0

MRAG3.0 architecture integrates document screenshots during the document parsing and indexing stages to minimize information loss. At the input stage, it incorporates a Multimodal Search Planning module, unifying Visual Question Answering (VQA) and Retrieval-Augmented Generation (RAG) tasks while refining user query precision. At the output stage, the Multimodal Retrieval-Augmented Composition module enhances answer generation by transforming plain text into multimodal formats, thereby enriching information delivery.

MRAG3.0

Table of Contents

Coming......

Paper List

Coming......

Releases

No releases published

Packages

No packages published