Skip to content
/ MTVM Public

[ECCV 2022] Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Notifications You must be signed in to change notification settings

clin1223/MTVM

Repository files navigation

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

PyTorch Code of the ECCV 2022 paper:

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation,
Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan

Introduction

  • Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving.

  • Our work explicitly models the history interactions between observations and the instruction which is critical for figuring out the progress of the navigation trajectory.

  • Our model MTVM achevies new state-of-the-art on R2R datasets, resulting 65% Success Rate and 59% SPL on unseen test set.

Results

Requirements

  • Linux or macOS with Python ≥ 3.6
  • PyTorch ≥ 1.6.
pip install -r requirements.txt
sudo apt-get install libjsoncpp-dev libepoxy-dev libglm-dev libosmesa6 libosmesa6-dev libglew-dev

Installation

Build the Simulator with following instruction. The simulater is version v0.1 of Matterport3D Simulator..

mkdir build && cd build
cmake -DOSMESA_RENDERING=ON ..
make

Prepare datasets

Please follow the data preparation as Recurrent VLN-BERT

R2R Navigation benchmark evaluation and training

The MTVM models are initialized from PREVALENT (indicated by --vlnbert in the train_agent.bash file). Please download the pretrain model and place them under Prevalent/pretrained_model/ before training the MTVM models.

To train a model, run

bash run/train_agent.bash

To evaluate a model with a trained/ pretrained model, run

bash run/test_agent.bash

Download the trained network weights here.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{lin2021multimodal,
  title={Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation},
  author={Lin, Chuang and Jiang, Yi and Cai, Jianfei and Qu, Lizhen and Haffari, Gholamreza and Yuan, Zehuan},
  journal={arXiv preprint arXiv:2111.05759},
  year={2021}
}

Acknowledgments

This repo is based on Recurrent VLN-BERT. Thanks for their wonderful works.

About

[ECCV 2022] Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published