Zhuoyan Luo*1, Yicheng Xiao*1, Yong Liu*12, Yitong Wang2, Yansong Tang1, Xiu Li1, Yujiu Yang1
1 Tsinghua Shenzhen International Graduate School, Tsinghua University 2 ByteDance Inc.
* Equal Contribution
πππ Paper
- We Release the Code for the The 5th Large-scale Video Object Segmentation Challenge.
The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance. Most prior works adopt unified DETR framework to generate segmentation masks in query-to-instance manner. In this work, we integrate strengths of that leading RVOS models to build up an effective paradigm. We first obtain binary mask sequences from the RVOS models. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally ensembles RVOS models based on framework design as well as training strategy, and leverages different video object segmentation (VOS) models to enhance mask coherence by object propagation mechanism. Our method achieves 75.7% J &F on Ref-Youtube-VOS validation set and 70% J &F on test set, which ranks 1st place on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3
As we use different RVOS models, we need to set up two version of environment.
- install pytorch
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
- install other dependencies
pip install h5py opencv-python protobuf av einops ruamel.yaml timm joblib pandas matplotlib cython scipy
- install transformers
pip install transformers
- install pycocotools
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
- install Pytorch Correlation (Recommend to install from source instead of using
pip
) - build up MultiScaleDeformableAttention
cd soc_test/models/ops python setup.py build install
- The environmet please refer to INSTALL.md for more details
- Follow each step to build up the environment
The Overall data preparation is set as followed. We put rvosdata under the path /mnt/data_16TB/lzy23/rvosdata please change to xxx/rvosdata
rvosdata
βββ refer_youtube_vos/
βββ train/
β βββ JPEGImages/
β β βββ */ (video folders)
β β βββ *.jpg (frame image files)
β βββ Annotations/
β βββ */ (video folders)
β βββ *.png (mask annotation files)
βββ valid/
β βββ JPEGImages/
β βββ */ (video folders)
β βββ *.jpg (frame image files)
βββ test/
β βββ JPEGImages/
β βββ */ (video folders)
β βββ *.jpg (frame image files)
βββ meta_expressions/
βββ train/
β βββ meta_expressions.json (text annotations)
βββ valid/
βββ meta_expressions.json (text annotations)
UNINEXT needs to generate the extra valid.json and test.json for inference and please refer to DATA.md/Ref-Youtube-VOS
We create a folder for storing all pretrained model and put them in the path /mnt/data_16TB/lzy23/pretrained, please change to xxx/pretrained
pretrained
βββ pretrained_swin_transformer
βββ pretrained_roberta
βββ bert-base-uncased
- for pretrained_swin_transformer folder download Video-Swin-Base
- for pretrained_roberta folder download config.json pytorch_model.bin tokenizer.json vocab.json from huggingface (roberta-base)
- for bert-base-uncased folder
wget -c https://huggingface.co/bert-base-uncased/resolve/main/config.json
wget -c https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
wget -c https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin
or download from huggingface
The Checkpoint we use are listed as follow: best organized that each model (backbone) corresponds to a folder.
Model | Backbone | Checkpoint |
---|---|---|
SOC | Video-Swin-Base | Model |
MUTR | Video-Swin-Base | Model |
Referformer_ft | Video-Swin-Base | Model |
UNINEXT | VIT-H | Model |
UNINEXT | Convnext | Model |
AOT | Swin-L | Model |
DEAOT | Swin-L | Model |
We joint train the model SOC
Generally we put all output under the dir, Specifically, we set /mnt/data_16TB/lzy23 as the output dir, so, please change it to xxx/.
if you want to joint train SOC, run the scripts ./soc_test/train_joint.sh. Before that, you need to change the path according to your path:
- ./soc_test/configs/refer_youtube.yaml (file)
- text_encoder_type (change /mnt/data_16TB/lzy23 to xxx) the follow is the same
- ./soc_test/datasets/refer_youtube_vos/
- dataset_path (variable name)
- line 164
- ./soc_test/utils.py
- line 23
- ./soc_test/train_joint.sh
- line 3
First, We need to use the checkpoint mentioned above to inference to get the Annotations.
SOC
change the test_encoder path in ./soc_test/configs/refer_youtube_vos.yaml line 77
- run the scripts ./soc_test/scripts/infer_refytb.sh to get the Annotations and key_frame.json and need to change the path.
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./soc_test/infer_refytb.py
- line 56 68
- ./soc_test/scripts/infer_refytb.sh
- line 3 4
- run the scripts ./soc_test/scripts/infer_ensemble_test.sh to get masks.pth for following ensemble
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./soc_test/infer_refyrb_ensemble.py
- line 46 54
- ./soc_test/scripts/infer_ensemble_test.sh
- line 2 3
MUTR
Before start change the text_encoder path (/mnt/data_16TB/lzy23/ -> xxx/) in ./MUTR/models/mutr.py line 127
- run the scripts ./MUTR/inference_ytvos.sh to obtain the Annotations
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./MUTR/inference_ytvos.sh
- line 4 5 6
- run the scripts ./MUTR/infer_ytvos_ensemble.sh to obtain mask.pth
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./MUTR/infer_ytvos_ensemble.sh
- line 4 5 6 run the command to generate the key_frame.json (change the path ptf.py line 7 9 10)
python3 ./MUTR/ptf.py
Referformer
Before start change the text_encoder path (/mnt/data_16TB/lzy23/ -> xxx/) in ./Referformer/models/referformer.py line 127
- run the scripts ./Referformer/infer_ytvos.sh to obtain the Annotations
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./Referformer/inference_ytvos.py
- line 59
- ./Referformer/infer_ytvos.sh
- line 3 4
- run the scripts ./Referformer/scripts/ensemble_for_test.sh
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./Referformer/ensemble_for_test.sh
- line 5 9 10 run the command to generate the key_frame.json (change the path ptf.py line 7 9 10)
python3 ./Referformer/ptf.py
UNINEXT We adopt two different backbones as our RVOS models, so follow the step to get the Annotations and mask.pth First change the text encoder in (/mnt/data_16TB/lzy23/ -> xxx/)
- ./UNINEXT/projects/UNINEXT/uninext/models/deformable_detr/bert_model.py line 17 19
- ./UNINEXT/projects/UNINEXT/uninext/data/dataset_mapper_ytbvis.py line 172
- ./UNINEXT/projects/UNINEXT/uninext/uninext_vid.py line 151 Second change the image_root and annotations_path ./UNINEXT/projects/UNINEXT/uninext/data/datasets/ytvis.py line 382 383
- VIT-H
- run the scripts ./UNINEXT/assets/infer_huge_rvos.sh
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./UNINEXT/projects/UNINEXT/configs/video_joint_vit_huge.yaml
- line 4 51
- ./UNINEXT/detectron2/evaluation/evaluator.py
- line 209 save_path run the command to generate the key_frame.json (change the path vit_ptf.py line 7 9 10)
python3 ./UNINEXT/vit_ptf.py
- Convnext
- run the scripts ./UNINEXT/assets/infer_huge_rvos.sh
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./UNINEXT/projects/UNINEXT/configs/video_joint_convnext_large.yaml
- line 4 51
- ./UNINEXT/detectron2/evaluation/evaluator.py
- make sure that you change /mnt/data_16TB/lzy23/test/model_pth/vit_huge.pth to xxx/test/model_pth/convnext.pth run the command to generate the key_frame.json (change the path vit_ptf.py line 7 9 10)
python3 ./UNINEXT/convnext_ptf.py
After generating all Annotations, the results should be in the following format
test
βββ soc/
βββ Annotations
βββ key_frame.json
βββ mutr/
βββ Annotations
βββ key_frame.json
βββ referformer_ft/
βββ Annotations
βββ key_frame.json
βββ vit-huge/
βββ Annotations
βββ key_frame.json
βββ convnext/
βββ Annotations
βββ key_frame.json
βββ model_pth/
βββ soc.pth
βββ mutr.pth
βββ referformer_ft.pth
βββ vit_huge.pth
βββ convnext.pth
Then as the pth is quite huge it is hard to load them in memory at a time, so run the following command to generate the split pth change the path in line 5 6
python3 split_pth.py
We adopt AOT and DEAOT to post-process the mask results.
- AOT
First, change the model_pth path in
- ./rvos_competition/soc_test/AOT/configs/default.py line 88 112 128 129 then run the following command
cd ./soc_test/AOT
bash eval_soc.sh
bash eval_mutr.sh
bash eval_referformer_ft.sh
if you have more GPU resources you can change the variable gpunum in the sh file. 2. DEAOT
change the model_pth path in
- ./rvos_competition/soc_test/DEAOT/configs/default.py line 88 112 128 129 then run the following command
cd ./soc_test/DEAOT
bash eval_vith.sh
bash eval_convnext.sh
bash eval_referformer_ft.sh
We first fuse three models. Remember to generate all annotations mentioned above. run the command below
Remember to change the path in the sh file test_swap_1.sh test_swap_2.sh line 2 3
cd ./soc_test/scripts
bash test_swap_1.sh
bash test_swap_2.sh
After we use AOT and DEOAT to post-process respectively run the scripts ./soc_test/AOT/eval_soc_mutr_referft.sh run the scripts ./soc_test/DEAOT/eval_vit_convext_soc.sh
First make sure that before doing the second ensemble, the format should be like
test
βββ soc/
βββ Annotations
βββ key_frame.json
βββ Annotations_AOT_class_index
βββ mutr/
βββ Annotations
βββ key_frame.json
βββ Annotations_AOT_class_index
βββ referformer_ft/
βββ Annotations
βββ key_frame.json
βββ Annotations_AOT_class_index
βββ Annotations_DEAOT_class_index
βββ vit-huge/
βββ Annotations
βββ key_frame.json
βββ Annotations_DEAOT_class_index
βββ convnext/
βββ Annotations
βββ key_frame.json
βββ Annotations_DEAOT_class_index
βββ soc_mutr_referft/
βββ Annotations
βββ key_frame.json
βββ Annotations_AOT_class_index
βββ vit_convnext_soc/
βββ Annotations
βββ key_frame.json
βββ Annotations_DEAOT_class_index
βββ model_pth/
βββ soc.pth
βββ mutr.pth
βββ referformer_ft.pth
βββ vit_huge.pth
βββ convnext.pth
βββ model_split/
βββ soc
βββ soc0.pth
βββ xxx
βββ mutr
βββ mutr0.pth
βββ xxx
βββ referformer_ft.pth
βββ referformer_ft0.pth
βββ vit_huge.pth
βββ vit_huge0.pth
βββ convnext.pth
βββ convnext0.pth
we will conduct two round ensemble.
-
run the scripts ./soc_test/scripts/test_ensemble_1.sh change the path in sh file (line 1 2 3) to get the en2 Annotations.
-
run the scripts ./soc_test/scripts/test_ensemble_2.sh also change the path in sh file (line 1 2 3) to get the final Annotations.
Finally the Annotations in second_ensemble folder and named vit_convnext_soc_deaot_vitdeaot_en2_referftdeaot is the submission
The Following table is the Annotations mentioned above
Model | Annotations |
---|---|
SOC | Oirgin AOT |
MUTR | Oirgin AOT |
Referformer | Oirgin AOT DEAOT |
Vit-Huge | Oirgin DEAOT |
Convnext | Oirgin DEAOT |
soc_mutr_referft | Oirgin AOT |
vit_convnext_soc | Oirgin DEAOT |
en2 | Annotations |
Final | Annotations |
Code in this repository is built upon several public repositories. Thanks for the wonderful works.
If you find this work useful for your research, please cite:
@article{SOC,
author = {Zhuoyan Luo and
Yicheng Xiao and
Yong Liu and
Shuyan Li and
Yitong Wang and
Yansong Tang and
Xiu Li and
Yujiu Yang},
title = {{SOC:} Semantic-Assisted Object Cluster for Referring Video Object
Segmentation},
journal = {CoRR},
volume = {abs/2305.17011},
year = {2023},
}