GitHub - DAMO-NLP-SG/VideoRefer: [CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

VideoRefer can understand any object you're interested within a video.

📰 News

[2025.6.19] 🔥We release the demo of VideoRefer-VideoLLaMA3, hosted on HuggingFace. Feel free to try it!
[2025.6.18] 🔥We release a new version, VideoRefer-VideoLLaMA3(VideoRefer-VideoLLaMA3-7B and VideoRefer-VideoLLaMA3-2B), which are trained based on VideoLLaMA3.
[2025.4.22] 🔥Our VideoRefer-Bench has been adopted in Describe Anything Model (NVIDIA & UC Berkeley).
[2025.2.27] 🔥VideoRefer Suite has been accepted to CVPR2025!
[2025.2.18] 🔥We release the VideoRefer-700K dataset on HuggingFace.
[2025.1.1] 🔥We release VideoRefer, including VideoRefer-7B model, the code of VideoRefer and the VideoRefer-Bench.

🚀 Performance

Performance on both region-level image and video benchmarks.

🤗 Huggingface Demo

The online demo (VideoRefer-VideoLLaMA3) is hosted on Huggingface Spaces.

The YouTube Video showcases a detailed video walkthrough of our online demo.

🎥 Video

videorefer-demo.mp4

HD video can be viewed on YouTube.

🔍 About VideoRefer Suite

VideoRefer Suite is designed to enhance the fine-grained spatial-temporal understanding capabilities of Video Large Language Models (Video LLMs). It consists of three primary components:

Model (VideoRefer)

VideoRefer is an effective Video LLM, which enables fine-grained perceiving, reasoning, and retrieval for user-defined regions at any specified timestamps—supporting both single-frame and multi-frame region inputs.

Dataset (VideoRefer-700K)

VideoRefer-700K is a large-scale, high-quality object-level video instruction dataset. Curated using a sophisticated multi-agent data engine to fill the gap for high-quality object-level video instruction data.

Benchmark (VideoRefer-Bench)

VideoRefer-Bench is a comprehensive benchmark to evaluate the object-level video understanding capabilities of a model, which consists of two sub-benchmarks: VideoRefer-Bench-D and VideoRefer-Bench-Q.

🛠️ Requirements and Installation

Basic Dependencies:

Python >= 3.8
Pytorch >= 2.2.0
CUDA Version >= 11.8
transformers == 4.40.0 (for reproducing paper results)
tokenizers == 0.19.1

Install required packages:

git clone https://github.com/DAMO-NLP-SG/VideoRefer
cd VideoRefer
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

🌟 Getting started

Please refer to the examples in notebooks for detailed instructions on how to use our model for image and video inference.

Model	Notebook	Description
VideoRefer	single-object.ipynb	Demonstrations of using VideoRefer for single object understanding with both single-frame mode and multi-frame mode.
VideoRefer	multi-object.ipynb	Demonstrations of using VideoRefer for multiple object question-answering with both single-frame mode and multi-frame mode.
VideoRefer-VideoLLaMA3	image.ipynb	Demonstrations of using VideoRefer-VideoLLaMA3 for image object understanding.
VideoRefer-VideoLLaMA3	video.ipynb	Demonstrations of using VideoRefer-VideoLLaMA3 for video object understanding.

For better usage, the demo integrates with SAM2, to get started, please install SAM2 first:

git clone https://github.com/facebookresearch/sam2.git && cd sam2

SAM2_BUILD_CUDA=0 pip install -e ".[notebooks]"

Then, download sam2.1_hiera_large.pt to checkpoints.

🌏 Model Zoo

Model Name	Visual Encoder	Language Decoder
VideoRefer-VideoLLaMA3-7B	VL3-SigLIP-NaViT	Qwen2.5-7B-Instruct
VideoRefer-VideoLLaMA3-2B	VL3-SigLIP-NaViT	Qwen2.5-1.5B-Instruct
VideoRefer-7B	siglip-so400m-patch14-384	Qwen2-7B-Instruct
VideoRefer-7B-stage2	siglip-so400m-patch14-384	Qwen2-7B-Instruct
VideoRefer-7B-stage2.5	siglip-so400m-patch14-384	Qwen2-7B-Instruct

🖨️ VideoRefer-700K

The dataset can be accessed on huggingface.

By leveraging our multi-agent data engine, we meticulously create three primary types of object-level video instruction data:

Object-level Detailed Caption
Object-level Short Caption
Object-level QA

Video sources:

Detailed&Short Caption
- Panda-70M.
QA
- MeViS
- A2D
- Youtube-VOS

Data format:

[
    {
        "video": "videos/xxx.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nWhat is the relationship of <region> and <region>?"
            },
            {
                "from": "gpt",
                "value": "...."
            },
            ...
        ],
        "annotation":[
            //object1
            {
                "frame_idx":{
                    "segmentation": {
                        //rle format or polygon
                    }
                }
                "frame_idx":{
                    "segmentation": {
                        //rle format or polygon
                    }
                }
            },
            //object2
            {
                "frame_idx":{
                    "segmentation": {
                        //rle format or polygon
                    }
                }
            },
            ...
        ]

    }

🕹️ VideoRefer-Bench

VideoRefer-Bench assesses the models in two key areas: Description Generation, corresponding to VideoRefer-BenchD, and Multiple-choice Question-Answer, corresponding to VideoRefer-BenchQ.

videorefer-bench.mp4

The annotations of the benchmark can be found in 🤗benchmark.
The usage of VideoRefer-Bench is detailed in doc.
To evaluate general MLLMs on VideoRefer-Bench, please refer to eval.

📑 Citation

If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:

@article{yuan2025videorefersuite,
  title = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
  author = {Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing},
  journal={arXiv},
  year={2025},
  url = {http://arxiv.org/abs/2501.00599}
}

💡 Some other multimodal-LLM projects from our team may interest you ✨.

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, Lidong Bing

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao

Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu

👍 Acknowledgement

The codebase of VideoRefer is adapted from VideoLLaMA 2 and VideoLLaMA 3.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
benchmark		benchmark
demo		demo
eval		eval
notebooks		notebooks
scripts		scripts
videorefer		videorefer
videorefer_videollama3		videorefer_videollama3
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
training.md		training.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

📰 News

🚀 Performance

🤗 Huggingface Demo

🎥 Video

🔍 About VideoRefer Suite

🛠️ Requirements and Installation

🌟 Getting started

🌏 Model Zoo

🖨️ VideoRefer-700K

🕹️ VideoRefer-Bench

📑 Citation

👍 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

DAMO-NLP-SG/VideoRefer

Folders and files

Latest commit

History

Repository files navigation

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

📰 News

🚀 Performance

🤗 Huggingface Demo

🎥 Video

🔍 About VideoRefer Suite

🛠️ Requirements and Installation

🌟 Getting started

🌏 Model Zoo

🖨️ VideoRefer-700K

🕹️ VideoRefer-Bench

📑 Citation

👍 Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages