Share information about downloading MERLOT and pretraining data

Rowan Zellers · Rowan Zellers · commit 2f3a059f5040 · 2021-06-16T14:59:57.000-07:00
diff --git a/README.md b/README.md
@@ -47,7 +47,9 @@ This requires a large TPU pod for data-parallelism.
 * Next, in the `model` directory, run `python train.py configs/merlot.yaml`
 
 ### Finetuning on downstream tasks
-* We used the configuration [model/merlot.yaml](model/merlot.yaml) and the checkpoint at `gs://merlot/checkpoint_4segments/` for downstream task finetuning. This is slightly different than the checkpoint we used for story unshuffling (that we had to adapt to account for the 5 frame-caption segments for that task), but both should work.
+* You can download our checkpoint using [download_checkpoint.py](download_checkpoint.py). There are two options -- we used a checkpoint with 4 frame-caption segments for general purpose pretraining, and then we trained it for longer (using 5 frame-caption segments) to adapt to the story ordering task. 
+
+  We suggest using the *4 segments* checkpoint because that's what we used for all of our finetuning experiments. This corresponds to the configuration at We used the configuration [model/merlot.yaml](model/merlot.yaml).
 * Actual finetuning code TBD -- you just create a `MerlotModel` [model/modeling.py](model/modeling.py), set up your finetuning task (usually involving an additional output layer), and finetune.
 
 
diff --git a/data/README.md b/data/README.md
@@ -1,5 +1,9 @@
-# MERLOT Data for pretraining
+# MERLOT Data for pretraining (YT-Temporal-180M)
 
 * `process.py` contains a quick script for turning the `example_video` into a very small tfrecord for pretraining.
+* The dataset is available for academic use, please contact Rowan for access. We probably cannot release the videos (for legal reasons and to protect privacy). What we are releasing are annotations that look like this
 
-We plan to release the full dataset soon for academic use... stay tuned.
+* `denoised`: a list of spans of `noisyasr` text, that was cleaned up with a finetuned Grover model (output is `cleanasr`). The perplexity of the context is under `ctx_ppl`
+* `info`: a dictionary of info with information about the YouTube video
+* `subtitles`: Each word, along with the approximate timestamp about when it was said in the video
+* `_te`: Time elapsed (this isn't needed at all)
diff --git a/download_checkpoint.py b/download_checkpoint.py
@@ -0,0 +1,29 @@
+"""
+Downloads checkpoint
+Use the argument if you want 5 segments (used for sort story) otherwise we use 4 segments (we used this for everything else)
+"""
+
+import os
+import requests
+import argparse
+
+parser = argparse.ArgumentParser(description='Download MERLOT!')
+parser.add_argument(
+    '-use_5_segments',
+    dest='use_5_segments',
+    action='store_true'
+)
+use_5_segments = parser.parse_args().use_5_segments
+nseg = 5 if use_5_segments else '4'
+model_dir = os.path.join('checkpoints', f'checkpoint_{nseg}segments/')
+if not os.path.exists(model_dir):
+    os.makedirs(model_dir)
+
+for ext in ['data-00000-of-00001', 'index', 'meta', 'checkpoint']:
+    r = requests.get(f'https://storage.googleapis.com/merlot/checkpoint_{nseg}segments/model.ckpt.{ext}', stream=True)
+    with open(os.path.join(model_dir, f'model.ckpt.{ext}'), 'wb') as f:
+        file_size = int(r.headers["content-length"])
+        chunk_size = min(1000, file_size)
+        for chunk in r.iter_content(chunk_size=chunk_size):
+            f.write(chunk)
+    print(f"Just downloaded merlot/checkpoint_{nseg}segments/model.ckpt.{ext}!", flush=True)