Sight meets Sound: Leveraging Audio for Improved Video Moment Retrieval using Multimodal Large Language Models
- Authors: Joël Tschesche, Habib Maraqten, Christian Bialas, Ahmed Fourati, Leon Wenderoth
We introduce SMS (Sight meets Sound), a multimodal, single-stage model that extends Chrono (arxiv) by multimodal audio-vision reasoning capabilities for Video Moment Retrieval. We achieve new state-of-the-art results on the challenging Charades-STA benchmark, and competitive results on QVHighlights.
# data & data preprocessing
./mr_BLIP_data
# pretrained checkpoints from MR.BLIP
./mr_BLIP_checkpoints
# model code
./lavis/
# running scripts for training and inference
./run_scripts
- (Optional) Creating conda environment from .yaml file
cd mr-Audio/envs
conda env create -f SMS.yaml
conda activate sms-env
- build from source
conda create -n SMS python=3.8
pip install -r requirements.txt
Checkpoints of SMS can be used for fine-tuning and training (use only Charades-STA and QVHighlights) Download the checkpoints and put them under /mr_BLIP_checkpoints.
We test our model on:
Please download original MR data and preprocess them via our scripts.
We provide SMS training and inference script examples as follows.
And please refer to dataset page to customize your data path.
You might want to update the config files for the respective runs to fit on your machine. They are currently set to run on 4 A100-80GB GPUs for Charades-STA and 8 A100-80GB GPUs for QVHighlights. You can simply reduce the batch size, reduce the number of frames, or apply a frame level embeddings aggregation (32 frame tokens -> 1 token) to fit on a smaller GPU.
sh run_scripts/mr_BLIP/train/qvh.sh
sh run_scripts/mr_BLIP/train/charades.sh
Should roughly return:
[email protected] | [email protected] | [email protected] | [email protected] | |
---|---|---|---|---|
SMS | 76.39 | 61.35 | 69.09 | 54.07 |
sh run_scripts/mr_BLIP/eval/qvh.sh
Should roughly return:
[email protected] | [email protected] | mIoU | |
---|---|---|---|
SMS | 72.54 | 51.0 | 60.90 |
sh run_scripts/mr_BLIP/eval/charades.sh
We also explore different model variations or setups in the following branches:
We thank the developers of CHRONO, LAVIS and BLIP-2 for their public code release.