Skip to content

Commit c85c18b

Browse files
author
Rowan Zellers
committed
Add model code, for both pretraining as well as for story unshuffling
1 parent c02d816 commit c85c18b

34 files changed

+55676
-2
lines changed

LICENSE

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2021 Rowan Zellers
3+
Copyright (c) 2021 Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

README.md

+59-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,60 @@
11
# merlot
2-
MERLOT: Multimodal Neural Script Knowledge Models
2+
[MERLOT: Multimodal Neural Script Knowledge Models](https://arxiv.org/abs/2106.02636)
3+
4+
MERLOT is a model for learning what we are calling "neural script knowledge" -- representations about what is going on in videos, spanning multiple video frames with associated captions.
5+
6+
7+
Visit our project page at [rowanzellers.com/merlot](https://rowanzellers.com/merlot), or read the [full paper](https://arxiv.org/abs/2106.02636) to learn more.
8+
9+
![teaser](https://i.imgur.com/RD6yb9E.png "teaser")
10+
11+
## What's here
12+
13+
We are releasing the following:
14+
* Code for the MERLOT model (in [model/](model/), with data processing in [data/](data/)
15+
* Code for running MERLOT over visual story ordering.
16+
17+
We plan to release:
18+
* Information about the videos used in this work
19+
* Code for adapting the model to other tasks (not strictly needed, but just to make things easier)
20+
21+
This is somewhat ongoing -- we hope to make it somewhat easier to adapt MERLOT to other tasks, please follow if interested!
22+
23+
## Enviroment and setup
24+
25+
There are two different ways of running MERLOT right now
26+
* **Pretraining on videos** This requires a TPU pod.
27+
* **Finetuning on downstream tasks** We did this on TPU v3-8 machines. You can in theory do this on GPUs, however, this isn't tested or officially supported right now.
28+
* **Zero-shot visual-story ordering** I have code for this on a TPU, but you should be able to do this on a GPU too.
29+
30+
31+
```bash
32+
conda create --name merlot python=3.7 && conda activate merlot
33+
conda install -y python=3.7 tqdm numpy pyyaml scipy ipython cython typing h5py pandas
34+
35+
# If running on GPU
36+
pip install tensorflow-gpu==1.15.5
37+
# If running on TPU
38+
pip install tensorflow==1.15.5
39+
40+
pip install --upgrade google-api-python-client oauth2client boto3 cloud-tpu-profiler regex opencv-python-headless Pillow seaborn
41+
pip install numpy==1.17.0
42+
```
43+
44+
### Pretraining from scratch
45+
* First, you'll need to get a bunch of training data in "tfrecord" format, and put it in "TRAIN_FILE_PATH". See data processing in [data/](data/) for that.
46+
* Next, in the `model` directory, run `python train.py configs/pretrain.yaml`
47+
48+
### Finetuning on downstream tasks
49+
* We used the configuration [model/merlot.yaml](model/merlot.yaml) and the checkpoint at `gs://merlot/checkpoint_4segments/` for downstream task finetuning. This is slightly different than the checkpoint we used for story unshuffling (that we had to adapt to account for the 5 frame-caption segments for that task), but both should work.
50+
* Actual finetuning code TBD -- you just create a `MerlotModel` [model/modeling.py](model/modeling.py), set up your finetuning task (usually involving an additional output layer), and finetune.
51+
52+
53+
### Bibtex
54+
@article{zellersluhessel2021merlot,
55+
title={MERLOT: Multimodal Neural Script Knowledge Models},
56+
author={Zellers, Rowan and Lu, Ximing and Hessel, Jack and Yu, Youngjae and Park, Jae Sung and Cao, Jize and Farhadi, Ali and Choi, Yejin},
57+
journal={arXiv preprint arXiv:2106.02636},
58+
year={2021}
59+
}
60+

data/README.md

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# MERLOT Data for pretraining
2+
3+
* `process.py` contains a quick script for turning the `example_video` into a very small tfrecord for pretraining.
4+
5+
We plan to release the full dataset soon for academic use... stay tuned.

0 commit comments

Comments
 (0)