- The code for MSRVTT and MSVD datasets
- The code for video-level LSDOs
- The code for frame-level LSDOs
- The code for LSMDC and DiDeMo datasets
- The code for text-level LSDOs
The official repository for LSDO.
Our model was trained and evaluated using the following package dependencies:
- Pytorch 1.8.0
- Python 3.7.12
Our model was trained on MSR-VTT and MSVD datasets. Please download the datasets utilizing following commands .
# MSR-VTT
wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip
# MSVD
wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msvd_data.zip
We utilize 1 NVIDIA RTX 3090 24GB GPU for training. You can directly train with following commands:
# MSR-VTT-9k
python train.py --exp_name=exp_name --videos_dir=videos_dir --scene_type=average --batch_size=32 --noclip_lr=3e-5 --dataset_name=MSRVTT --msrvtt_train_file=9k
# MSR-VTT-7K
python train.py --exp_name=exp_name --videos_dir=videos_dir --scene_type=average --batch_size=32 --noclip_lr=1e-5 --dataset_name=MSRVTT --msrvtt_train_file=7k
# MSVD
python train.py --exp_name=exp_name --videos_dir=videos_dir --scene_type=average --batch_size=32 --noclip_lr=1e-5 --dataset_name=MSVD
# MSR-VTT-9k
python train.py --exp_name=exp_name --videos_dir=videos_dir --scene_type=average --batch_size=32 --load_epoch=-1 --dataset_name=MSRVTT --msrvtt_train_file=9k
# MSR-VTT-7K
python train.py --exp_name=exp_name --videos_dir=videos_dir --scene_type=average --batch_size=32 --load_epoch=-1 --dataset_name=MSRVTT --msrvtt_train_file=7k
# MSVD
python train.py --exp_name=exp_name --videos_dir=videos_dir --scene_type=average --batch_size=32 --load_epoch=-1 --dataset_name=MSVD
If you find this work useful in your research, please cite the following paper:
# BibTeX
@ARTICLE{10841928,
author={Zheng, Yanwei and Huang, Bowen and Chen, Zekai and Yu, Dongxiao},
journal={IEEE Transactions on Image Processing},
title={Enhancing Text-Video Retrieval Performance With Low-Salient but Discriminative Objects},
year={2025},
volume={34},
number={},
pages={581-593},
keywords={Feature extraction;Semantics;Transformers;Visualization;Prototypes;Computational modeling;Aggregates;Indexes;Encoding;Context modeling;Text-video retrieval;low-salient but discriminative objects;cross-modal attention},
doi={10.1109/TIP.2025.3527369}}
# GB/T 7714
[1] Zheng Y , Huang B , Chen Z ,et al.Enhancing Text-Video Retrieval Performance With Low-Salient but Discriminative Objects[J].IEEE Transactions on Image Processing, 2025.DOI:10.1109/TIP.2025.3527369.
# MLA
[1] Zheng, Yanwei , et al. "Enhancing Text-Video Retrieval Performance With Low-Salient but Discriminative Objects." IEEE Transactions on Image Processing (2025).
# APA
[1] Zheng, Y. , Huang, B. , Chen, Z. , & Yu, D. . (2025). Enhancing text-video retrieval performance with low-salient but discriminative objects. IEEE Transactions on Image Processing.
Codebase from X-Pool.