Skip to content

Commit 2f3a059

Browse files
author
Rowan Zellers
committed
Share information about downloading MERLOT and pretraining data
1 parent 67b93fd commit 2f3a059

File tree

3 files changed

+38
-3
lines changed

3 files changed

+38
-3
lines changed

README.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,9 @@ This requires a large TPU pod for data-parallelism.
4747
* Next, in the `model` directory, run `python train.py configs/merlot.yaml`
4848

4949
### Finetuning on downstream tasks
50-
* We used the configuration [model/merlot.yaml](model/merlot.yaml) and the checkpoint at `gs://merlot/checkpoint_4segments/` for downstream task finetuning. This is slightly different than the checkpoint we used for story unshuffling (that we had to adapt to account for the 5 frame-caption segments for that task), but both should work.
50+
* You can download our checkpoint using [download_checkpoint.py](download_checkpoint.py). There are two options -- we used a checkpoint with 4 frame-caption segments for general purpose pretraining, and then we trained it for longer (using 5 frame-caption segments) to adapt to the story ordering task.
51+
52+
We suggest using the *4 segments* checkpoint because that's what we used for all of our finetuning experiments. This corresponds to the configuration at We used the configuration [model/merlot.yaml](model/merlot.yaml).
5153
* Actual finetuning code TBD -- you just create a `MerlotModel` [model/modeling.py](model/modeling.py), set up your finetuning task (usually involving an additional output layer), and finetune.
5254

5355

data/README.md

+6-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
1-
# MERLOT Data for pretraining
1+
# MERLOT Data for pretraining (YT-Temporal-180M)
22

33
* `process.py` contains a quick script for turning the `example_video` into a very small tfrecord for pretraining.
4+
* The dataset is available for academic use, please contact Rowan for access. We probably cannot release the videos (for legal reasons and to protect privacy). What we are releasing are annotations that look like this
45

5-
We plan to release the full dataset soon for academic use... stay tuned.
6+
* `denoised`: a list of spans of `noisyasr` text, that was cleaned up with a finetuned Grover model (output is `cleanasr`). The perplexity of the context is under `ctx_ppl`
7+
* `info`: a dictionary of info with information about the YouTube video
8+
* `subtitles`: Each word, along with the approximate timestamp about when it was said in the video
9+
* `_te`: Time elapsed (this isn't needed at all)

download_checkpoint.py

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
"""
2+
Downloads checkpoint
3+
Use the argument if you want 5 segments (used for sort story) otherwise we use 4 segments (we used this for everything else)
4+
"""
5+
6+
import os
7+
import requests
8+
import argparse
9+
10+
parser = argparse.ArgumentParser(description='Download MERLOT!')
11+
parser.add_argument(
12+
'-use_5_segments',
13+
dest='use_5_segments',
14+
action='store_true'
15+
)
16+
use_5_segments = parser.parse_args().use_5_segments
17+
nseg = 5 if use_5_segments else '4'
18+
model_dir = os.path.join('checkpoints', f'checkpoint_{nseg}segments/')
19+
if not os.path.exists(model_dir):
20+
os.makedirs(model_dir)
21+
22+
for ext in ['data-00000-of-00001', 'index', 'meta', 'checkpoint']:
23+
r = requests.get(f'https://storage.googleapis.com/merlot/checkpoint_{nseg}segments/model.ckpt.{ext}', stream=True)
24+
with open(os.path.join(model_dir, f'model.ckpt.{ext}'), 'wb') as f:
25+
file_size = int(r.headers["content-length"])
26+
chunk_size = min(1000, file_size)
27+
for chunk in r.iter_content(chunk_size=chunk_size):
28+
f.write(chunk)
29+
print(f"Just downloaded merlot/checkpoint_{nseg}segments/model.ckpt.{ext}!", flush=True)

0 commit comments

Comments
 (0)