Skip to content

[ICCV 2023 Workshop] The Official Implementation of The First Prize Solution for RVOS Competition

Notifications You must be signed in to change notification settings

RobertLuo1/iccv2023_RVOS_Challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ICCV2023: The 5th Large-scale Video Object Segmentation Challenge

1st place solution for track three: Referring Video Object Segmentation Challenge.

Zhuoyan Luo*1, Yicheng Xiao*1, Yong Liu*12, Yitong Wang2, Yansong Tang1, Xiu Li1, Yujiu Yang1

1 Tsinghua Shenzhen International Graduate School, Tsinghua University 2 ByteDance Inc.

* Equal Contribution

😊😊😊 Paper

πŸ“’ Updates:

  • We Release the Code for the The 5th Large-scale Video Object Segmentation Challenge.

πŸ“– Abstract

The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance. Most prior works adopt unified DETR framework to generate segmentation masks in query-to-instance manner. In this work, we integrate strengths of that leading RVOS models to build up an effective paradigm. We first obtain binary mask sequences from the RVOS models. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally ensembles RVOS models based on framework design as well as training strategy, and leverages different video object segmentation (VOS) models to enhance mask coherence by object propagation mechanism. Our method achieves 75.7% J &F on Ref-Youtube-VOS validation set and 70% J &F on test set, which ranks 1st place on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3

πŸ“— FrameWork

πŸ› οΈ Environment Setup

As we use different RVOS models, we need to set up two version of environment.

First Environment (for SOC MUTR Referformer AOT DEAOT)

  • install pytorch pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
  • install other dependencies pip install h5py opencv-python protobuf av einops ruamel.yaml timm joblib pandas matplotlib cython scipy
  • install transformers pip install transformers
  • install pycocotools pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
  • install Pytorch Correlation (Recommend to install from source instead of using pip)
  • build up MultiScaleDeformableAttention
    cd soc_test/models/ops
    python setup.py build install
    

Second Environment (for UNINEXT)

  • The environmet please refer to INSTALL.md for more details
  • Follow each step to build up the environment

Data Preparation

The Overall data preparation is set as followed. We put rvosdata under the path /mnt/data_16TB/lzy23/rvosdata please change to xxx/rvosdata

rvosdata
└── refer_youtube_vos/ 
    β”œβ”€β”€ train/
    β”‚   β”œβ”€β”€ JPEGImages/
    β”‚   β”‚   └── */ (video folders)
    β”‚   β”‚       └── *.jpg (frame image files) 
    β”‚   └── Annotations/
    β”‚       └── */ (video folders)
    β”‚           └── *.png (mask annotation files) 
    β”œβ”€β”€ valid/
    β”‚   └── JPEGImages/
    β”‚       └── */ (video folders)
    β”‚           └── *.jpg (frame image files)
    β”œβ”€β”€ test/
    β”‚   └── JPEGImages/
    β”‚       └── */ (video folders)
    β”‚           └── *.jpg (frame image files) 
    └── meta_expressions/
        β”œβ”€β”€ train/
        β”‚   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)

UNINEXT needs to generate the extra valid.json and test.json for inference and please refer to DATA.md/Ref-Youtube-VOS

Pretrained Model Preparation

We create a folder for storing all pretrained model and put them in the path /mnt/data_16TB/lzy23/pretrained, please change to xxx/pretrained

pretrained
└── pretrained_swin_transformer
└── pretrained_roberta
└── bert-base-uncased
  • for pretrained_swin_transformer folder download Video-Swin-Base
  • for pretrained_roberta folder download config.json pytorch_model.bin tokenizer.json vocab.json from huggingface (roberta-base)
  • for bert-base-uncased folder
wget -c https://huggingface.co/bert-base-uncased/resolve/main/config.json
wget -c https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
wget -c https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin

or download from huggingface

Model_Zoo

The Checkpoint we use are listed as follow: best organized that each model (backbone) corresponds to a folder.

Model Backbone Checkpoint
SOC Video-Swin-Base Model
MUTR Video-Swin-Base Model
Referformer_ft Video-Swin-Base Model
UNINEXT VIT-H Model
UNINEXT Convnext Model
AOT Swin-L Model
DEAOT Swin-L Model

πŸš€ Training

We joint train the model SOC

Output_dir

Generally we put all output under the dir, Specifically, we set /mnt/data_16TB/lzy23 as the output dir, so, please change it to xxx/.

if you want to joint train SOC, run the scripts ./soc_test/train_joint.sh. Before that, you need to change the path according to your path:

  • ./soc_test/configs/refer_youtube.yaml (file)
    • text_encoder_type (change /mnt/data_16TB/lzy23 to xxx) the follow is the same
  • ./soc_test/datasets/refer_youtube_vos/
    • dataset_path (variable name)
    • line 164
  • ./soc_test/utils.py
    • line 23
  • ./soc_test/train_joint.sh
    • line 3

πŸš€ Testing

First, We need to use the checkpoint mentioned above to inference to get the Annotations.

SOC

change the test_encoder path in ./soc_test/configs/refer_youtube_vos.yaml line 77

  • run the scripts ./soc_test/scripts/infer_refytb.sh to get the Annotations and key_frame.json and need to change the path.
    • Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
    • ./soc_test/infer_refytb.py
      • line 56 68
    • ./soc_test/scripts/infer_refytb.sh
      • line 3 4
  • run the scripts ./soc_test/scripts/infer_ensemble_test.sh to get masks.pth for following ensemble
    • Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
    • ./soc_test/infer_refyrb_ensemble.py
      • line 46 54
    • ./soc_test/scripts/infer_ensemble_test.sh
      • line 2 3

MUTR

Before start change the text_encoder path (/mnt/data_16TB/lzy23/ -> xxx/) in ./MUTR/models/mutr.py line 127

  • run the scripts ./MUTR/inference_ytvos.sh to obtain the Annotations
    • Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
    • ./MUTR/inference_ytvos.sh
      • line 4 5 6
  • run the scripts ./MUTR/infer_ytvos_ensemble.sh to obtain mask.pth
    • Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
    • ./MUTR/infer_ytvos_ensemble.sh
      • line 4 5 6 run the command to generate the key_frame.json (change the path ptf.py line 7 9 10)
python3 ./MUTR/ptf.py

Referformer

Before start change the text_encoder path (/mnt/data_16TB/lzy23/ -> xxx/) in ./Referformer/models/referformer.py line 127

  • run the scripts ./Referformer/infer_ytvos.sh to obtain the Annotations
    • Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
    • ./Referformer/inference_ytvos.py
      • line 59
    • ./Referformer/infer_ytvos.sh
      • line 3 4
  • run the scripts ./Referformer/scripts/ensemble_for_test.sh
    • Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
    • ./Referformer/ensemble_for_test.sh
      • line 5 9 10 run the command to generate the key_frame.json (change the path ptf.py line 7 9 10)
python3 ./Referformer/ptf.py

UNINEXT We adopt two different backbones as our RVOS models, so follow the step to get the Annotations and mask.pth First change the text encoder in (/mnt/data_16TB/lzy23/ -> xxx/)

  • ./UNINEXT/projects/UNINEXT/uninext/models/deformable_detr/bert_model.py line 17 19
  • ./UNINEXT/projects/UNINEXT/uninext/data/dataset_mapper_ytbvis.py line 172
  • ./UNINEXT/projects/UNINEXT/uninext/uninext_vid.py line 151 Second change the image_root and annotations_path ./UNINEXT/projects/UNINEXT/uninext/data/datasets/ytvis.py line 382 383
  1. VIT-H
  • run the scripts ./UNINEXT/assets/infer_huge_rvos.sh
    • Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
    • ./UNINEXT/projects/UNINEXT/configs/video_joint_vit_huge.yaml
      • line 4 51
    • ./UNINEXT/detectron2/evaluation/evaluator.py
      • line 209 save_path run the command to generate the key_frame.json (change the path vit_ptf.py line 7 9 10)
python3 ./UNINEXT/vit_ptf.py
  1. Convnext
  • run the scripts ./UNINEXT/assets/infer_huge_rvos.sh
    • Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
    • ./UNINEXT/projects/UNINEXT/configs/video_joint_convnext_large.yaml
      • line 4 51
    • ./UNINEXT/detectron2/evaluation/evaluator.py
      • make sure that you change /mnt/data_16TB/lzy23/test/model_pth/vit_huge.pth to xxx/test/model_pth/convnext.pth run the command to generate the key_frame.json (change the path vit_ptf.py line 7 9 10)
python3 ./UNINEXT/convnext_ptf.py

After generating all Annotations, the results should be in the following format

test
└── soc/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
└── mutr/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
└── referformer_ft/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
└── vit-huge/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
└── convnext/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
└── model_pth/
    β”œβ”€β”€ soc.pth
    β”œβ”€β”€ mutr.pth
    β”œβ”€β”€ referformer_ft.pth
    β”œβ”€β”€ vit_huge.pth
    β”œβ”€β”€ convnext.pth

Then as the pth is quite huge it is hard to load them in memory at a time, so run the following command to generate the split pth change the path in line 5 6

python3 split_pth.py

Post-Processing

We adopt AOT and DEAOT to post-process the mask results.

  1. AOT

First, change the model_pth path in

  • ./rvos_competition/soc_test/AOT/configs/default.py line 88 112 128 129 then run the following command
cd ./soc_test/AOT
bash eval_soc.sh
bash eval_mutr.sh
bash eval_referformer_ft.sh

if you have more GPU resources you can change the variable gpunum in the sh file. 2. DEAOT

change the model_pth path in

  • ./rvos_competition/soc_test/DEAOT/configs/default.py line 88 112 128 129 then run the following command
cd ./soc_test/DEAOT
bash eval_vith.sh
bash eval_convnext.sh
bash eval_referformer_ft.sh

First Round Ensemble

We first fuse three models. Remember to generate all annotations mentioned above. run the command below

Remember to change the path in the sh file test_swap_1.sh test_swap_2.sh line 2 3

cd ./soc_test/scripts
bash test_swap_1.sh
bash test_swap_2.sh

After we use AOT and DEOAT to post-process respectively run the scripts ./soc_test/AOT/eval_soc_mutr_referft.sh run the scripts ./soc_test/DEAOT/eval_vit_convext_soc.sh

Second Ensemble

First make sure that before doing the second ensemble, the format should be like

test
└── soc/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
    β”œβ”€β”€ Annotations_AOT_class_index
└── mutr/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
    β”œβ”€β”€ Annotations_AOT_class_index
└── referformer_ft/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
    β”œβ”€β”€ Annotations_AOT_class_index
    β”œβ”€β”€ Annotations_DEAOT_class_index
└── vit-huge/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
    β”œβ”€β”€ Annotations_DEAOT_class_index
└── convnext/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
    β”œβ”€β”€ Annotations_DEAOT_class_index
└── soc_mutr_referft/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
    β”œβ”€β”€ Annotations_AOT_class_index
└── vit_convnext_soc/
    β”œβ”€β”€ Annotations
    β”œβ”€β”€ key_frame.json
    β”œβ”€β”€ Annotations_DEAOT_class_index
└── model_pth/
    β”œβ”€β”€ soc.pth
    β”œβ”€β”€ mutr.pth
    β”œβ”€β”€ referformer_ft.pth
    β”œβ”€β”€ vit_huge.pth
    β”œβ”€β”€ convnext.pth
└── model_split/
    β”œβ”€β”€ soc
        β”œβ”€β”€ soc0.pth 
        β”œβ”€β”€ xxx
    β”œβ”€β”€ mutr
        β”œβ”€β”€ mutr0.pth 
        β”œβ”€β”€ xxx
    β”œβ”€β”€ referformer_ft.pth
        β”œβ”€β”€ referformer_ft0.pth 
    β”œβ”€β”€ vit_huge.pth
        β”œβ”€β”€ vit_huge0.pth 
    β”œβ”€β”€ convnext.pth
        β”œβ”€β”€ convnext0.pth 

we will conduct two round ensemble.

  1. run the scripts ./soc_test/scripts/test_ensemble_1.sh change the path in sh file (line 1 2 3) to get the en2 Annotations.

  2. run the scripts ./soc_test/scripts/test_ensemble_2.sh also change the path in sh file (line 1 2 3) to get the final Annotations.

Finally the Annotations in second_ensemble folder and named vit_convnext_soc_deaot_vitdeaot_en2_referftdeaot is the submission

The Following table is the Annotations mentioned above

Model Annotations
SOC Oirgin AOT
MUTR Oirgin AOT
Referformer Oirgin AOT DEAOT
Vit-Huge Oirgin DEAOT
Convnext Oirgin DEAOT
soc_mutr_referft Oirgin AOT
vit_convnext_soc Oirgin DEAOT
en2 Annotations
Final Annotations

Acknowledgement

Code in this repository is built upon several public repositories. Thanks for the wonderful works.

If you find this work useful for your research, please cite:

@article{SOC,
  author       = {Zhuoyan Luo and
                  Yicheng Xiao and
                  Yong Liu and
                  Shuyan Li and
                  Yitong Wang and
                  Yansong Tang and
                  Xiu Li and
                  Yujiu Yang},
  title        = {{SOC:} Semantic-Assisted Object Cluster for Referring Video Object
                  Segmentation},
  journal      = {CoRR},
  volume       = {abs/2305.17011},
  year         = {2023},
}

About

[ICCV 2023 Workshop] The Official Implementation of The First Prize Solution for RVOS Competition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published