Update READMEs for torch.hub

Summary: Pull Request resolved: fairinternal/fairseq-py#795 Differential Revision: D16620488 Pulled By: myleott fbshipit-source-id: 1998a9ccd8816fc7f590861fb4898f910a36bc1e
LLL-Orleans · Aug 2, 2019 · abb7ed4 · abb7ed4
1 parent 5f34252
commit abb7ed4
Show file tree

Hide file tree

Showing 14 changed files with 530 additions and 556 deletions.
diff --git a/examples/backtranslation/README.md b/examples/backtranslation/README.md
@@ -4,29 +4,32 @@ This page includes pre-trained models from the paper [Understanding Back-Transla
 
 ## Pre-trained models
 
-Description | Dataset | Model | Test set(s)
+Model | Description | Dataset | Download
 ---|---|---|---
-Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381); WMT'18 winner) | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) | See NOTE in the archive
+`transformer.wmt18.en-de` | Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381)) <br> WMT'18 winner | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) <br> See NOTE in the archive
 
 ## Example usage
 
 Interactive generation from the full ensemble via PyTorch Hub:
-```
->>> import torch
->>> torch.hub.list('pytorch/fairseq')
-[..., 'transformer.wmt14.en-fr', 'transformer.wmt16.en-de', 'transformer.wmt18.en-de', ... ]
->>> en2de_ensemble = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt18.en-de',
-...   checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
-...   data_name_or_path='.',
-...   tokenizer='moses',
-...   bpe='subword_nmt',
-... )
->>> len(en2de_ensemble.models)
-5
->>> print(en2de_ensemble.generate('Hello world!'))
-Hallo Welt!
+```python
+import torch
+
+# List available models
+torch.hub.list('pytorch/fairseq')  # [..., 'transformer.wmt18.en-de', ... ]
+
+# Load the WMT'18 En-De ensemble
+en2de_ensemble = torch.hub.load(
+    'pytorch/fairseq', 'transformer.wmt18.en-de',
+    checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
+    tokenizer='moses', bpe='subword_nmt')
+
+# The ensemble contains 5 models
+len(en2de_ensemble.models)
+# 5
+
+# Translate
+en2de_ensemble.translate('Hello world!')
+# 'Hallo Welt!'
 ```
 
 ## Citation

diff --git a/examples/language_model/README.md b/examples/language_model/README.md
@@ -2,36 +2,30 @@
 
 ## Pre-trained models
 
-Description | Parameters | Dataset | Model and Test set(s)
----|---:|---|---
-Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) | 1026M | [Google Billion Words](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_gbw_huge.tar.bz2)
-Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) | 247M | [WikiText-103](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_wiki103.tar.bz2)
-
+Model | Description | Dataset | Download
+---|---|---|---
+`transformer_lm.gbw.adaptive_huge` | Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) <br> 1026M params | [Google Billion Words](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_gbw_huge.tar.bz2)
+`transformer_lm.wiki103.adaptive` | Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) <br> 247M params | [WikiText-103](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_wiki103.tar.bz2)
+`transformer_lm.wmt19.en` | English LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
+`transformer_lm.wmt19.de` | German LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
+`transformer_lm.wmt19.ru` | Russian LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)
 
 ## Example usage
 
-Interactive generation via PyTorch Hub:
-```
->>> import torch
->>> torch.hub.list('pytorch/fairseq')
-[..., 'transformer_lm.gbw.adaptive_huge', 'transformer_lm.wiki103.adaptive', ...]
->>> lm = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer_lm.wiki103.adaptive',
-...   data_name_or_path='./data-bin',
-...   tokenizer='moses',
-...   no_escape=True,
-...   beam=1,
-...   sampling=True,
-...   sampling_topk=10,
-...   temperature=0.8,
-... )
->>> lm.generate('Barack Obama', verbose=True)
-```
+Sampling from a language model using PyTorch Hub:
+```python
+import torch
 
-Available models are listed in the ``hub_models()`` method in each model file, for example:
-[transformer_lm.py](https://github.com/pytorch/fairseq/blob/master/fairseq/models/transformer_lm.py).
+# List available models
+torch.hub.list('pytorch/fairseq')  # [..., 'transformer_lm.wmt19.en', ...]
 
+# Load an English LM trained on WMT'19 News Crawl data
+en_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.en', tokenizer='moses', bpe='fastbpe')
+
+# Sample from the language model
+en_lm.sample('Barack Obama', beam=1, sampling=True, sampling_topk=10, temperature=0.8)
+# "Barack Obama is coming to Sydney and New Zealand (...)"
+```
 
 ## Training a new model with the CLI tools
 
@@ -44,47 +38,47 @@ Provides an example of pre-processing for [WikiText-103 language modeling task](
 Example usage:
 
 Prepare data:
-```
-$ cd examples/language_model/
-$ bash prepare-wikitext-103.sh
-$ cd ../..
+```bash
+cd examples/language_model/
+bash prepare-wikitext-103.sh
+cd ../..
 
 # Binarize the dataset:
-$ TEXT=examples/language_model/wikitext-103
+TEXT=examples/language_model/wikitext-103
 
-$ fairseq-preprocess --only-source \
-  --trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \ 
-  --destdir data-bin/wikitext-103
+fairseq-preprocess --only-source \
+    --trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \ 
+    --destdir data-bin/wikitext-103
 ```
 
 Train a transformer language model with adaptive inputs ([Baevski and Auli (2018): Adaptive Input Representations for Neural Language Modeling](transformer_lm/README.md)):
-```
+```bash
 # If it runs out of memory, try to reduce max-tokens and tokens-per-sample
-$ mkdir -p checkpoints/transformer_wikitext-103
-$ fairseq-train --task language_modeling data-bin/wikitext-103 \
-  --save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
-  --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
-  --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
-  --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
-  --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d
+mkdir -p checkpoints/transformer_wikitext-103
+fairseq-train --task language_modeling data-bin/wikitext-103 \
+    --save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
+    --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
+    --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
+    --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
+    --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d
 
 # Evaluate:
-$ fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
-  --sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
+fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
+    --sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
 ```
 
 Train a convolutional language model ([Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](conv_lm/README.md)):
 ```
 # If it runs out of memory, try to reduce max-tokens and tokens-per-sample
-$ mkdir -p checkpoints/fconv_wikitext-103
-$ fairseq-train --task language_modeling data-bin/wikitext-103 \
-  --save-dir checkpoints/fconv_wikitext-103 \
-  --max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
-  --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
-  --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
-  --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
-  --ddp-backend=no_c10d
+mkdir -p checkpoints/fconv_wikitext-103
+fairseq-train --task language_modeling data-bin/wikitext-103 \
+    --save-dir checkpoints/fconv_wikitext-103 \
+    --max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
+    --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
+    --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
+    --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
+    --ddp-backend=no_c10d
 
 # Evaluate:
-$ fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
+fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
 ```
diff --git a/examples/roberta/README.finetune_custom_classification.md b/examples/roberta/README.finetune_custom_classification.md
@@ -1,14 +1,16 @@
-# RoBERTa fine-tuning on custom classification task (example IMDB)
+# Finetuning RoBERTa on a custom classification task
 
-## 1) Get the data
-```
+This example shows how to finetune RoBERTa on the IMDB dataset, but should illustrate the process for most classification tasks.
+
+### 1) Get the data
+```bash
 wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
 tar zxvf aclImdb_v1.tar.gz
 ```
 
-## 2) Format data
+### 2) Format data
 `IMDB` data has one data-sample in each file, below python code-snippet converts it one file for train and valid each for ease of processing.  
-```
+```python
 import argparse
 import os
 import random
@@ -42,79 +44,78 @@ if __name__ == '__main__':
     main(args)
 ```
 
-## 3) BPE Encode
+### 3) BPE Encode
 Run `multiprocessing_bpe_encoder`, you can also do this in previous step for each sample but that might be slower.
-```
+```bash
 # Download encoder.json and vocab.bpe
 wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
 wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
 
-for SPLIT in train dev;
-do
-  python -m examples.roberta.multiprocessing_bpe_encoder \
-  --encoder-json encoder.json \
-  --vocab-bpe vocab.bpe \
-  --inputs "aclImdb/$SPLIT.input0" \
-  --outputs "aclImdb/$SPLIT.input0.bpe" \
-  --workers 60 \
-  --keep-empty;
+for SPLIT in train dev; do
+    python -m examples.roberta.multiprocessing_bpe_encoder \
+        --encoder-json encoder.json \
+        --vocab-bpe vocab.bpe \
+        --inputs "aclImdb/$SPLIT.input0" \
+        --outputs "aclImdb/$SPLIT.input0.bpe" \
+        --workers 60 \
+        --keep-empty
 done
 ```
 
+### 4) Preprocess data
 
-## 4) Preprocess data
-
-```
+```bash
 # Download fairseq dictionary.
 wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'  
 
 fairseq-preprocess \
-  --only-source \
-  --trainpref "aclImdb/train.input0.bpe" \
-  --validpref "aclImdb/dev.input0.bpe" \
-  --destdir "IMDB-bin/input0" \
-  --workers 60 \
-  --srcdict dict.txt;
+    --only-source \
+    --trainpref "aclImdb/train.input0.bpe" \
+    --validpref "aclImdb/dev.input0.bpe" \
+    --destdir "IMDB-bin/input0" \
+    --workers 60 \
+    --srcdict dict.txt
 
 fairseq-preprocess \
-  --only-source \
-  --trainpref "aclImdb/train.label" \
-  --validpref "aclImdb/dev.label" \
-  --destdir "IMDB-bin/label" \
-  --workers 60;
+    --only-source \
+    --trainpref "aclImdb/train.label" \
+    --validpref "aclImdb/dev.label" \
+    --destdir "IMDB-bin/label" \
+    --workers 60
 
 ```
 
-## 5) Run Training
+### 5) Run Training
 
-```
+```bash
 TOTAL_NUM_UPDATES=7812  # 10 epochs through IMDB for bsz 32
 WARMUP_UPDATES=469      # 6 percent of the number of updates
 LR=1e-05                # Peak LR for polynomial LR scheduler.
 NUM_CLASSES=2
 MAX_SENTENCES=8        # Batch size.
+ROBERTA_PATH=/path/to/roberta/model.pt
 
 CUDA_VISIBLE_DEVICES=0 python train.py IMDB-bin/ \
---restore-file <roberta_large_absolute_path> \
---max-positions 512 \
---max-sentences $MAX_SENTENCES \
---max-tokens 4400 \
---task sentence_prediction \
---reset-optimizer --reset-dataloader --reset-meters \
---required-batch-size-multiple 1 \
---init-token 0 --separator-token 2 \
---arch roberta_large \
---criterion sentence_prediction \
---num-classes $NUM_CLASSES \
---dropout 0.1 --attention-dropout 0.1 \
---weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
---clip-norm 0.0 \
---lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
---fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
---max-epoch 10 \
---best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
---truncate-sequence \
---update-freq 4;
+    --restore-file $ROBERTA_PATH \
+    --max-positions 512 \
+    --max-sentences $MAX_SENTENCES \
+    --max-tokens 4400 \
+    --task sentence_prediction \
+    --reset-optimizer --reset-dataloader --reset-meters \
+    --required-batch-size-multiple 1 \
+    --init-token 0 --separator-token 2 \
+    --arch roberta_large \
+    --criterion sentence_prediction \
+    --num-classes $NUM_CLASSES \
+    --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
+    --clip-norm 0.0 \
+    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
+    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
+    --max-epoch 10 \
+    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
+    --truncate-sequence \
+    --update-freq 4
 ```
 Above will train with effective batch-size of `32`, tested on one `Nvidia V100 32gb`.
 Expected `best-validation-accuracy` after `10` epochs is `~96.5%`.
diff --git a/examples/roberta/README.finetune_glue.md b/examples/roberta/README.finetune_glue.md
@@ -0,0 +1,66 @@
+# Finetuning RoBERTa on GLUE tasks
+
+### 1) Download the data from GLUE website (https://gluebenchmark.com/tasks) using following commands:
+```bash
+wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
+python download_glue_data.py --data_dir glue_data --tasks all
+```
+
+### 2) Preprocess GLUE task data:
+```bash
+./examples/roberta/preprocess_GLUE_tasks.sh glue_data <glue_task_name>
+```
+`glue_task_name` is one of the following:
+`{ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA}`
+Use `ALL` for preprocessing all the glue tasks.
+
+### 3) Fine-tuning on GLUE task:
+Example fine-tuning cmd for `RTE` task
+```bash
+TOTAL_NUM_UPDATES=2036  # 10 epochs through RTE for bsz 16
+WARMUP_UPDATES=122      # 6 percent of the number of updates
+LR=2e-05                # Peak LR for polynomial LR scheduler.
+NUM_CLASSES=2
+MAX_SENTENCES=16        # Batch size.
+ROBERTA_PATH=/path/to/roberta/model.pt
+
+CUDA_VISIBLE_DEVICES=0 python train.py RTE-bin/ \
+    --restore-file $ROBERTA_PATH \
+    --max-positions 512 \
+    --max-sentences $MAX_SENTENCES \
+    --max-tokens 4400 \
+    --task sentence_prediction \
+    --reset-optimizer --reset-dataloader --reset-meters \
+    --required-batch-size-multiple 1 \
+    --init-token 0 --separator-token 2 \
+    --arch roberta_large \
+    --criterion sentence_prediction \
+    --num-classes $NUM_CLASSES \
+    --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
+    --clip-norm 0.0 \
+    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
+    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
+    --max-epoch 10 \
+    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
+```
+
+For each of the GLUE task, you will need to use following cmd-line arguments:
+
+Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
+---|---|---|---|---|---|---|---|---
+`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
+`--lr` | 1e-5 | 1e-5 | 1e-5 | 2e-5 | 1e-5 | 1e-5 | 1e-5 | 2e-5
+`--max-sentences` | 32 | 32 | 32 | 16 | 32 | 16 | 16 | 16
+`--total-num-update` | 123873 | 33112 | 113272 | 2036 | 20935 | 2296 | 5336 | 3598
+`--warmup-updates` | 7432 | 1986 | 28318 | 122 | 1256 | 137 | 320 | 214
+
+For `STS-B` additionally add `--regression-target --best-checkpoint-metric loss` and remove `--maximize-best-checkpoint-metric`.
+
+**Note:**
+
+a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--max-sentences=16/32` depending on the task.
+
+b) Above cmd-args and hyperparams are tested on one Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--max-sentences`.
+
+c) All the settings in above table are suggested settings based on our hyperparam search within a fixed search space (for careful comparison across models). You might be able to find better metrics with wider hyperparam search.