|
1 | 1 | # merlot
|
2 |
| -MERLOT: Multimodal Neural Script Knowledge Models |
| 2 | +[MERLOT: Multimodal Neural Script Knowledge Models](https://arxiv.org/abs/2106.02636) |
| 3 | + |
| 4 | +MERLOT is a model for learning what we are calling "neural script knowledge" -- representations about what is going on in videos, spanning multiple video frames with associated captions. |
| 5 | + |
| 6 | + |
| 7 | +Visit our project page at [rowanzellers.com/merlot](https://rowanzellers.com/merlot), or read the [full paper](https://arxiv.org/abs/2106.02636) to learn more. |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | +## What's here |
| 12 | + |
| 13 | +We are releasing the following: |
| 14 | +* Code for the MERLOT model (in [model/](model/), with data processing in [data/](data/) |
| 15 | +* Code for running MERLOT over visual story ordering. |
| 16 | + |
| 17 | +We plan to release: |
| 18 | +* Information about the videos used in this work |
| 19 | +* Code for adapting the model to other tasks (not strictly needed, but just to make things easier) |
| 20 | + |
| 21 | +This is somewhat ongoing -- we hope to make it somewhat easier to adapt MERLOT to other tasks, please follow if interested! |
| 22 | + |
| 23 | +## Enviroment and setup |
| 24 | + |
| 25 | +There are two different ways of running MERLOT right now |
| 26 | +* **Pretraining on videos** This requires a TPU pod. |
| 27 | +* **Finetuning on downstream tasks** We did this on TPU v3-8 machines. You can in theory do this on GPUs, however, this isn't tested or officially supported right now. |
| 28 | +* **Zero-shot visual-story ordering** I have code for this on a TPU, but you should be able to do this on a GPU too. |
| 29 | + |
| 30 | + |
| 31 | +```bash |
| 32 | +conda create --name merlot python=3.7 && conda activate merlot |
| 33 | +conda install -y python=3.7 tqdm numpy pyyaml scipy ipython cython typing h5py pandas |
| 34 | + |
| 35 | +# If running on GPU |
| 36 | +pip install tensorflow-gpu==1.15.5 |
| 37 | +# If running on TPU |
| 38 | +pip install tensorflow==1.15.5 |
| 39 | + |
| 40 | +pip install --upgrade google-api-python-client oauth2client boto3 cloud-tpu-profiler regex opencv-python-headless Pillow seaborn |
| 41 | +pip install numpy==1.17.0 |
| 42 | +``` |
| 43 | + |
| 44 | +### Pretraining from scratch |
| 45 | +* First, you'll need to get a bunch of training data in "tfrecord" format, and put it in "TRAIN_FILE_PATH". See data processing in [data/](data/) for that. |
| 46 | +* Next, in the `model` directory, run `python train.py configs/pretrain.yaml` |
| 47 | + |
| 48 | +### Finetuning on downstream tasks |
| 49 | +* We used the configuration [model/merlot.yaml](model/merlot.yaml) and the checkpoint at `gs://merlot/checkpoint_4segments/` for downstream task finetuning. This is slightly different than the checkpoint we used for story unshuffling (that we had to adapt to account for the 5 frame-caption segments for that task), but both should work. |
| 50 | +* Actual finetuning code TBD -- you just create a `MerlotModel` [model/modeling.py](model/modeling.py), set up your finetuning task (usually involving an additional output layer), and finetune. |
| 51 | + |
| 52 | + |
| 53 | +### Bibtex |
| 54 | +@article{zellersluhessel2021merlot, |
| 55 | + title={MERLOT: Multimodal Neural Script Knowledge Models}, |
| 56 | + author={Zellers, Rowan and Lu, Ximing and Hessel, Jack and Yu, Youngjae and Park, Jae Sung and Cao, Jize and Farhadi, Ali and Choi, Yejin}, |
| 57 | + journal={arXiv preprint arXiv:2106.02636}, |
| 58 | + year={2021} |
| 59 | +} |
| 60 | + |
0 commit comments