Skip to content
/ mr-Audio Public
forked from sudo-Boris/mr-Blip

Moment Retrieval using Video and Audio

License

Notifications You must be signed in to change notification settings

globc/mr-Audio

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sight meets Sound: Leveraging Audio for Improved Video Moment Retrieval using Multimodal Large Language Models

  • Authors: Joël Tschesche, Habib Maraqten, Christian Bialas, Ahmed Fourati, Leon Wenderoth

We introduce SMS (Sight meets Sound), a multimodal, single-stage model that extends Chrono (arxiv) by multimodal audio-vision reasoning capabilities for Video Moment Retrieval. We achieve new state-of-the-art results on the challenging Charades-STA benchmark, and competitive results on QVHighlights.

teaser image

architecture image

Code structure

# data & data preprocessing
./mr_BLIP_data

# pretrained checkpoints from MR.BLIP
./mr_BLIP_checkpoints

# model code
./lavis/

# running scripts for training and inference
./run_scripts

Setup

Install Dependencies

  1. (Optional) Creating conda environment from .yaml file
cd mr-Audio/envs
conda env create -f SMS.yaml
conda activate sms-env
  1. build from source
conda create -n SMS python=3.8
pip install -r requirements.txt

Download Pretrained Models

Checkpoints of SMS can be used for fine-tuning and training (use only Charades-STA and QVHighlights) Download the checkpoints and put them under /mr_BLIP_checkpoints.

Dataset Preparation

We test our model on:

Please download original MR data and preprocess them via our scripts.

Training and Inference

We provide SMS training and inference script examples as follows.

And please refer to dataset page to customize your data path.

You might want to update the config files for the respective runs to fit on your machine. They are currently set to run on 4 A100-80GB GPUs for Charades-STA and 8 A100-80GB GPUs for QVHighlights. You can simply reduce the batch size, reduce the number of frames, or apply a frame level embeddings aggregation (32 frame tokens -> 1 token) to fit on a smaller GPU.

1) QVH Finetuning

sh run_scripts/mr_BLIP/train/qvh.sh

2) Charades-STA Finetuning

sh run_scripts/mr_BLIP/train/charades.sh

4) QVH Evaluation

Should roughly return:

[email protected] [email protected] [email protected] [email protected]
SMS 76.39 61.35 69.09 54.07
sh run_scripts/mr_BLIP/eval/qvh.sh

5) Charades-STA Evaluation

Should roughly return:

[email protected] [email protected] mIoU
SMS 72.54 51.0 60.90
sh run_scripts/mr_BLIP/eval/charades.sh

Additional Branches

We also explore different model variations or setups in the following branches:

Acknowledgments

We thank the developers of CHRONO, LAVIS and BLIP-2 for their public code release.

About

Moment Retrieval using Video and Audio

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 56.3%
  • Jupyter Notebook 43.1%
  • Shell 0.6%