diff --git a/examples/backtranslation/README.md b/examples/backtranslation/README.md
index cb010855cb..a834214adf 100644
--- a/examples/backtranslation/README.md
+++ b/examples/backtranslation/README.md
@@ -4,29 +4,32 @@ This page includes pre-trained models from the paper [Understanding Back-Transla
 
 ## Pre-trained models
 
-Description | Dataset | Model | Test set(s)
+Model | Description | Dataset | Download
 ---|---|---|---
-Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381); WMT'18 winner) | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) | See NOTE in the archive
+`transformer.wmt18.en-de` | Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381)) <br> WMT'18 winner | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) <br> See NOTE in the archive
 
 ## Example usage
 
 Interactive generation from the full ensemble via PyTorch Hub:
-```
->>> import torch
->>> torch.hub.list('pytorch/fairseq')
-[..., 'transformer.wmt14.en-fr', 'transformer.wmt16.en-de', 'transformer.wmt18.en-de', ... ]
->>> en2de_ensemble = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt18.en-de',
-...   checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
-...   data_name_or_path='.',
-...   tokenizer='moses',
-...   bpe='subword_nmt',
-... )
->>> len(en2de_ensemble.models)
-5
->>> print(en2de_ensemble.generate('Hello world!'))
-Hallo Welt!
+```python
+import torch
+
+# List available models
+torch.hub.list('pytorch/fairseq')  # [..., 'transformer.wmt18.en-de', ... ]
+
+# Load the WMT'18 En-De ensemble
+en2de_ensemble = torch.hub.load(
+    'pytorch/fairseq', 'transformer.wmt18.en-de',
+    checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt',
+    tokenizer='moses', bpe='subword_nmt')
+
+# The ensemble contains 5 models
+len(en2de_ensemble.models)
+# 5
+
+# Translate
+en2de_ensemble.translate('Hello world!')
+# 'Hallo Welt!'
 ```
 
 ## Citation
diff --git a/examples/language_model/README.md b/examples/language_model/README.md
index 4b041146e3..180714de49 100644
--- a/examples/language_model/README.md
+++ b/examples/language_model/README.md
@@ -2,36 +2,30 @@
 
 ## Pre-trained models
 
-Description | Parameters | Dataset | Model and Test set(s)
----|---:|---|---
-Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) | 1026M | [Google Billion Words](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_gbw_huge.tar.bz2)
-Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) | 247M | [WikiText-103](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_wiki103.tar.bz2)
-
+Model | Description | Dataset | Download
+---|---|---|---
+`transformer_lm.gbw.adaptive_huge` | Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) <br> 1026M params | [Google Billion Words](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_gbw_huge.tar.bz2)
+`transformer_lm.wiki103.adaptive` | Adaptive Inputs <br> ([Baevski and Auli, 2018](https://arxiv.org/abs/1809.10853)) <br> 247M params | [WikiText-103](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/lm/adaptive_lm_wiki103.tar.bz2)
+`transformer_lm.wmt19.en` | English LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
+`transformer_lm.wmt19.de` | German LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
+`transformer_lm.wmt19.ru` | Russian LM <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) | [WMT News Crawl](http://data.statmt.org/news-crawl/) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)
 
 ## Example usage
 
-Interactive generation via PyTorch Hub:
-```
->>> import torch
->>> torch.hub.list('pytorch/fairseq')
-[..., 'transformer_lm.gbw.adaptive_huge', 'transformer_lm.wiki103.adaptive', ...]
->>> lm = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer_lm.wiki103.adaptive',
-...   data_name_or_path='./data-bin',
-...   tokenizer='moses',
-...   no_escape=True,
-...   beam=1,
-...   sampling=True,
-...   sampling_topk=10,
-...   temperature=0.8,
-... )
->>> lm.generate('Barack Obama', verbose=True)
-```
+Sampling from a language model using PyTorch Hub:
+```python
+import torch
 
-Available models are listed in the ``hub_models()`` method in each model file, for example:
-[transformer_lm.py](https://github.com/pytorch/fairseq/blob/master/fairseq/models/transformer_lm.py).
+# List available models
+torch.hub.list('pytorch/fairseq')  # [..., 'transformer_lm.wmt19.en', ...]
 
+# Load an English LM trained on WMT'19 News Crawl data
+en_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.en', tokenizer='moses', bpe='fastbpe')
+
+# Sample from the language model
+en_lm.sample('Barack Obama', beam=1, sampling=True, sampling_topk=10, temperature=0.8)
+# "Barack Obama is coming to Sydney and New Zealand (...)"
+```
 
 ## Training a new model with the CLI tools
 
@@ -44,47 +38,47 @@ Provides an example of pre-processing for [WikiText-103 language modeling task](
 Example usage:
 
 Prepare data:
-```
-$ cd examples/language_model/
-$ bash prepare-wikitext-103.sh
-$ cd ../..
+```bash
+cd examples/language_model/
+bash prepare-wikitext-103.sh
+cd ../..
 
 # Binarize the dataset:
-$ TEXT=examples/language_model/wikitext-103
+TEXT=examples/language_model/wikitext-103
 
-$ fairseq-preprocess --only-source \
-  --trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \ 
-  --destdir data-bin/wikitext-103
+fairseq-preprocess --only-source \
+    --trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \ 
+    --destdir data-bin/wikitext-103
 ```
 
 Train a transformer language model with adaptive inputs ([Baevski and Auli (2018): Adaptive Input Representations for Neural Language Modeling](transformer_lm/README.md)):
-```
+```bash
 # If it runs out of memory, try to reduce max-tokens and tokens-per-sample
-$ mkdir -p checkpoints/transformer_wikitext-103
-$ fairseq-train --task language_modeling data-bin/wikitext-103 \
-  --save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
-  --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
-  --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
-  --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
-  --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d
+mkdir -p checkpoints/transformer_wikitext-103
+fairseq-train --task language_modeling data-bin/wikitext-103 \
+    --save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
+    --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
+    --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
+    --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
+    --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d
 
 # Evaluate:
-$ fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
-  --sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
+fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
+    --sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
 ```
 
 Train a convolutional language model ([Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](conv_lm/README.md)):
 ```
 # If it runs out of memory, try to reduce max-tokens and tokens-per-sample
-$ mkdir -p checkpoints/fconv_wikitext-103
-$ fairseq-train --task language_modeling data-bin/wikitext-103 \
-  --save-dir checkpoints/fconv_wikitext-103 \
-  --max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
-  --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
-  --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
-  --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
-  --ddp-backend=no_c10d
+mkdir -p checkpoints/fconv_wikitext-103
+fairseq-train --task language_modeling data-bin/wikitext-103 \
+    --save-dir checkpoints/fconv_wikitext-103 \
+    --max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
+    --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
+    --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
+    --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
+    --ddp-backend=no_c10d
 
 # Evaluate:
-$ fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
+fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
 ```
diff --git a/examples/roberta/README.finetune_custom_classification.md b/examples/roberta/README.finetune_custom_classification.md
index de3a4cc37a..cd49348f56 100644
--- a/examples/roberta/README.finetune_custom_classification.md
+++ b/examples/roberta/README.finetune_custom_classification.md
@@ -1,14 +1,16 @@
-# RoBERTa fine-tuning on custom classification task (example IMDB)
+# Finetuning RoBERTa on a custom classification task
 
-## 1) Get the data
-```
+This example shows how to finetune RoBERTa on the IMDB dataset, but should illustrate the process for most classification tasks.
+
+### 1) Get the data
+```bash
 wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
 tar zxvf aclImdb_v1.tar.gz
 ```
 
-## 2) Format data
+### 2) Format data
 `IMDB` data has one data-sample in each file, below python code-snippet converts it one file for train and valid each for ease of processing.  
-```
+```python
 import argparse
 import os
 import random
@@ -42,79 +44,78 @@ if __name__ == '__main__':
     main(args)
 ```
 
-## 3) BPE Encode
+### 3) BPE Encode
 Run `multiprocessing_bpe_encoder`, you can also do this in previous step for each sample but that might be slower.
-```
+```bash
 # Download encoder.json and vocab.bpe
 wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
 wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
 
-for SPLIT in train dev;
-do
-  python -m examples.roberta.multiprocessing_bpe_encoder \
-  --encoder-json encoder.json \
-  --vocab-bpe vocab.bpe \
-  --inputs "aclImdb/$SPLIT.input0" \
-  --outputs "aclImdb/$SPLIT.input0.bpe" \
-  --workers 60 \
-  --keep-empty;
+for SPLIT in train dev; do
+    python -m examples.roberta.multiprocessing_bpe_encoder \
+        --encoder-json encoder.json \
+        --vocab-bpe vocab.bpe \
+        --inputs "aclImdb/$SPLIT.input0" \
+        --outputs "aclImdb/$SPLIT.input0.bpe" \
+        --workers 60 \
+        --keep-empty
 done
 ```
 
+### 4) Preprocess data
 
-## 4) Preprocess data
-
-```
+```bash
 # Download fairseq dictionary.
 wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'  
 
 fairseq-preprocess \
-  --only-source \
-  --trainpref "aclImdb/train.input0.bpe" \
-  --validpref "aclImdb/dev.input0.bpe" \
-  --destdir "IMDB-bin/input0" \
-  --workers 60 \
-  --srcdict dict.txt;
+    --only-source \
+    --trainpref "aclImdb/train.input0.bpe" \
+    --validpref "aclImdb/dev.input0.bpe" \
+    --destdir "IMDB-bin/input0" \
+    --workers 60 \
+    --srcdict dict.txt
 
 fairseq-preprocess \
-  --only-source \
-  --trainpref "aclImdb/train.label" \
-  --validpref "aclImdb/dev.label" \
-  --destdir "IMDB-bin/label" \
-  --workers 60;
+    --only-source \
+    --trainpref "aclImdb/train.label" \
+    --validpref "aclImdb/dev.label" \
+    --destdir "IMDB-bin/label" \
+    --workers 60
 
 ```
 
-## 5) Run Training
+### 5) Run Training
 
-```
+```bash
 TOTAL_NUM_UPDATES=7812  # 10 epochs through IMDB for bsz 32
 WARMUP_UPDATES=469      # 6 percent of the number of updates
 LR=1e-05                # Peak LR for polynomial LR scheduler.
 NUM_CLASSES=2
 MAX_SENTENCES=8        # Batch size.
+ROBERTA_PATH=/path/to/roberta/model.pt
 
 CUDA_VISIBLE_DEVICES=0 python train.py IMDB-bin/ \
---restore-file <roberta_large_absolute_path> \
---max-positions 512 \
---max-sentences $MAX_SENTENCES \
---max-tokens 4400 \
---task sentence_prediction \
---reset-optimizer --reset-dataloader --reset-meters \
---required-batch-size-multiple 1 \
---init-token 0 --separator-token 2 \
---arch roberta_large \
---criterion sentence_prediction \
---num-classes $NUM_CLASSES \
---dropout 0.1 --attention-dropout 0.1 \
---weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
---clip-norm 0.0 \
---lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
---fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
---max-epoch 10 \
---best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
---truncate-sequence \
---update-freq 4;
+    --restore-file $ROBERTA_PATH \
+    --max-positions 512 \
+    --max-sentences $MAX_SENTENCES \
+    --max-tokens 4400 \
+    --task sentence_prediction \
+    --reset-optimizer --reset-dataloader --reset-meters \
+    --required-batch-size-multiple 1 \
+    --init-token 0 --separator-token 2 \
+    --arch roberta_large \
+    --criterion sentence_prediction \
+    --num-classes $NUM_CLASSES \
+    --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
+    --clip-norm 0.0 \
+    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
+    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
+    --max-epoch 10 \
+    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
+    --truncate-sequence \
+    --update-freq 4
 ```
 Above will train with effective batch-size of `32`, tested on one `Nvidia V100 32gb`.
 Expected `best-validation-accuracy` after `10` epochs is `~96.5%`.
diff --git a/examples/roberta/README.finetune_glue.md b/examples/roberta/README.finetune_glue.md
new file mode 100644
index 0000000000..c905cab7c0
--- /dev/null
+++ b/examples/roberta/README.finetune_glue.md
@@ -0,0 +1,66 @@
+# Finetuning RoBERTa on GLUE tasks
+
+### 1) Download the data from GLUE website (https://gluebenchmark.com/tasks) using following commands:
+```bash
+wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
+python download_glue_data.py --data_dir glue_data --tasks all
+```
+
+### 2) Preprocess GLUE task data:
+```bash
+./examples/roberta/preprocess_GLUE_tasks.sh glue_data <glue_task_name>
+```
+`glue_task_name` is one of the following:
+`{ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA}`
+Use `ALL` for preprocessing all the glue tasks.
+
+### 3) Fine-tuning on GLUE task:
+Example fine-tuning cmd for `RTE` task
+```bash
+TOTAL_NUM_UPDATES=2036  # 10 epochs through RTE for bsz 16
+WARMUP_UPDATES=122      # 6 percent of the number of updates
+LR=2e-05                # Peak LR for polynomial LR scheduler.
+NUM_CLASSES=2
+MAX_SENTENCES=16        # Batch size.
+ROBERTA_PATH=/path/to/roberta/model.pt
+
+CUDA_VISIBLE_DEVICES=0 python train.py RTE-bin/ \
+    --restore-file $ROBERTA_PATH \
+    --max-positions 512 \
+    --max-sentences $MAX_SENTENCES \
+    --max-tokens 4400 \
+    --task sentence_prediction \
+    --reset-optimizer --reset-dataloader --reset-meters \
+    --required-batch-size-multiple 1 \
+    --init-token 0 --separator-token 2 \
+    --arch roberta_large \
+    --criterion sentence_prediction \
+    --num-classes $NUM_CLASSES \
+    --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
+    --clip-norm 0.0 \
+    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
+    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
+    --max-epoch 10 \
+    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
+```
+
+For each of the GLUE task, you will need to use following cmd-line arguments:
+
+Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
+---|---|---|---|---|---|---|---|---
+`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
+`--lr` | 1e-5 | 1e-5 | 1e-5 | 2e-5 | 1e-5 | 1e-5 | 1e-5 | 2e-5
+`--max-sentences` | 32 | 32 | 32 | 16 | 32 | 16 | 16 | 16
+`--total-num-update` | 123873 | 33112 | 113272 | 2036 | 20935 | 2296 | 5336 | 3598
+`--warmup-updates` | 7432 | 1986 | 28318 | 122 | 1256 | 137 | 320 | 214
+
+For `STS-B` additionally add `--regression-target --best-checkpoint-metric loss` and remove `--maximize-best-checkpoint-metric`.
+
+**Note:**
+
+a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--max-sentences=16/32` depending on the task.
+
+b) Above cmd-args and hyperparams are tested on one Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--max-sentences`.
+
+c) All the settings in above table are suggested settings based on our hyperparam search within a fixed search space (for careful comparison across models). You might be able to find better metrics with wider hyperparam search.  
diff --git a/examples/roberta/README.md b/examples/roberta/README.md
index 989c9d750e..e975789f01 100644
--- a/examples/roberta/README.md
+++ b/examples/roberta/README.md
@@ -39,85 +39,83 @@ Model | Accuracy | Middle | High
 ## Example usage
 
 ##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
-```
->>> import torch
->>> roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
->>> roberta.eval()  # disable dropout (or leave in train mode to finetune)
+```python
+import torch
+roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
+roberta.eval()  # disable dropout (or leave in train mode to finetune)
 ```
 
 ##### Load RoBERTa (for PyTorch 1.0):
-```
-$ wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
-$ tar -xzvf roberta.large.tar.gz
+```python
+# Download roberta.large model
+wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
+tar -xzvf roberta.large.tar.gz
 
->>> from fairseq.models.roberta import RobertaModel
->>> roberta = RobertaModel.from_pretrained('/path/to/roberta.large')
->>> roberta.eval()  # disable dropout (or leave in train mode to finetune)
+# Load the model in fairseq
+from fairseq.models.roberta import RobertaModel
+roberta = RobertaModel.from_pretrained('/path/to/roberta.large')
+roberta.eval()  # disable dropout (or leave in train mode to finetune)
 ```
 
 ##### Apply Byte-Pair Encoding (BPE) to input text:
-```
->>> tokens = roberta.encode('Hello world!')
->>> tokens
-tensor([    0, 31414,   232,   328,     2])
->>> roberta.decode(tokens)
-'Hello world!'
+```python
+tokens = roberta.encode('Hello world!')
+assert tokens.tolist() == [0, 31414, 232, 328, 2]
+roberta.decode(tokens)  # 'Hello world!'
 ```
 
 ##### Extract features from RoBERTa:
-```
->>> last_layer_features = roberta.extract_features(tokens)
->>> last_layer_features.size()
-torch.Size([1, 5, 1024])
+```python
+# Extract the last layer's features
+last_layer_features = roberta.extract_features(tokens)
+assert last_layer_features.size() == torch.Size([1, 5, 1024])
 
->>> all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
->>> len(all_layers)
-25
-
->>> torch.all(all_layers[-1] == last_layer_features)
-tensor(1, dtype=torch.uint8)
+# Extract all layer's features (layer 0 is the embedding layer)
+all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
+assert len(all_layers) == 25
+assert torch.all(all_layers[-1] == last_layer_features)
 ```
 
 ##### Use RoBERTa for sentence-pair classification tasks:
-```
->>> roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')  # already finetuned
->>> roberta.eval()  # disable dropout for evaluation
-
->>> tokens = roberta.encode(
-...   'Roberta is a heavily optimized version of BERT.',
-...   'Roberta is not very optimized.'
-... )
-
->>> roberta.predict('mnli', tokens).argmax()
-tensor(0)  # contradiction
+```python
+# Download RoBERTa already finetuned for MNLI
+roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
+roberta.eval()  # disable dropout for evaluation
 
->>> tokens = roberta.encode(
-...   'Roberta is a heavily optimized version of BERT.',
-...   'Roberta is based on BERT.'
-... )
+# Encode a pair of sentences and make a prediction
+tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.')
+roberta.predict('mnli', tokens).argmax()  # 0: contradiction
 
->>> roberta.predict('mnli', tokens).argmax()
-tensor(2)  # entailment
+# Encode another pair of sentences
+tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is based on BERT.')
+roberta.predict('mnli', tokens).argmax()  # 2: entailment
 ```
 
 ##### Register a new (randomly initialized) classification head:
+```python
+roberta.register_classification_head('new_task', num_classes=3)
+logprobs = roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], grad_fn=<LogSoftmaxBackward>)
 ```
->>> roberta.register_classification_head('new_task', num_classes=3)
->>> roberta.predict('new_task', tokens)
-tensor([[-1.1050, -1.0672, -1.1245]], grad_fn=<LogSoftmaxBackward>)
+
+##### Batched prediction:
+```python
+from fairseq.data.data_utils import collate_tokens
+sentences = ['Hello world.', 'Another unrelated sentence.']
+batch = collate_tokens([roberta.encode(sent) for sent in sentences], pad_idx=1)
+logprobs = roberta.predict('new_task', batch)
+assert logprobs.size() == torch.Size([2, 3])
 ```
 
 ##### Using the GPU:
-```
->>> roberta.cuda()
->>> roberta.predict('new_task', tokens)
-tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
+```python
+roberta.cuda()
+roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
 ```
 
 ##### Evaluating the `roberta.large.mnli` model
 
 Example python code snippet to evaluate accuracy on the MNLI dev_matched set.
-```
+```python
 label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'}
 ncorrect, nsamples = 0, 0
 roberta.cuda()
@@ -137,79 +135,11 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
 ```
 
 
-## Finetuning on GLUE tasks
-
-##### 1) Download the data from GLUE website (https://gluebenchmark.com/tasks) using following commands:
-```
-$ wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
-$ python download_glue_data.py --data_dir glue_data --tasks all
-```
-
-##### 2) Preprocess GLUE task data:
-```
-$ ./examples/roberta/preprocess_GLUE_tasks.sh glue_data <glue_task_name>
-```
-`glue_task_name` is one of the following:
-`{ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA}`
-Use `ALL` for preprocessing all the glue tasks.
-
-##### 3) Fine-tuning on GLUE task :
-Example fine-tuning cmd for `RTE` task
-```
-TOTAL_NUM_UPDATES=2036  # 10 epochs through RTE for bsz 16
-WARMUP_UPDATES=122      # 6 percent of the number of updates
-LR=2e-05                # Peak LR for polynomial LR scheduler.
-NUM_CLASSES=2
-MAX_SENTENCES=16        # Batch size.
-
-CUDA_VISIBLE_DEVICES=0 python train.py RTE-bin/ \
---restore-file <roberta_large_absolute_path> \
---max-positions 512 \
---max-sentences $MAX_SENTENCES \
---max-tokens 4400 \
---task sentence_prediction \
---reset-optimizer --reset-dataloader --reset-meters \
---required-batch-size-multiple 1 \
---init-token 0 --separator-token 2 \
---arch roberta_large \
---criterion sentence_prediction \
---num-classes $NUM_CLASSES \
---dropout 0.1 --attention-dropout 0.1 \
---weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
---clip-norm 0.0 \
---lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
---fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
---max-epoch 10 \
---best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
-```
-
-For each of the GLUE task, you will need to use following cmd-line arguments:
-
-Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
----|---|---|---|---|---|---|---|---
-`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
-`--lr` | 1e-5 | 1e-5 | 1e-5 | 2e-5 | 1e-5 | 1e-5 | 1e-5 | 2e-5
-`--max-sentences` | 32 | 32 | 32 | 16 | 32 | 16 | 16 | 16
-`--total-num-update` | 123873 | 33112 | 113272 | 2036 | 20935 | 2296 | 5336 | 3598
-`--warmup-updates` | 7432 | 1986 | 28318 | 122 | 1256 | 137 | 320 | 214
-
-For `STS-B` additionally use following cmd-line argument:
-```
---regression-target
---best-checkpoint-metric loss
-```
-and remove `--maximize-best-checkpoint-metric`.
-
-**Note:**
-
-a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--max-sentences=16/32` depending on the task.
-
-b) Above cmd-args and hyperparams are tested on one Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--max-sentences`.
-
-c) All the settings in above table are suggested settings based on our hyperparam search within a fixed search space (for careful comparison across models). You might be able to find better metrics with wider hyperparam search.  
+## Finetuning
 
-## Fine-tuning on custom classification tasks
-[Example of fine-tuning Roberta on simple custom classification task](README.finetune_custom_classification.md)
+- [Finetuning on GLUE](README.finetune_glue.md)
+- [Finetuning on custom classification tasks (e.g., IMDB)](README.finetune_custom_classification.md)
+- Finetuning on SQuAD: coming soon
 
 ## Pretraining using your own data
 
@@ -223,11 +153,11 @@ A more detailed tutorial is coming soon.
 
 ```bibtex
 @article{liu2019roberta,
-  title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
-  author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and
-            Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and
-            Luke Zettlemoyer and Veselin Stoyanov},
-  journal={arXiv preprint arXiv:1907.11692},
-  year = {2019},
+    title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
+    author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and
+              Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and
+              Luke Zettlemoyer and Veselin Stoyanov},
+    journal={arXiv preprint arXiv:1907.11692},
+    year = {2019},
 }
 ```
diff --git a/examples/scaling_nmt/README.md b/examples/scaling_nmt/README.md
index d31aa3ae9e..d814436a46 100644
--- a/examples/scaling_nmt/README.md
+++ b/examples/scaling_nmt/README.md
@@ -4,10 +4,10 @@ This page includes instructions for reproducing results from the paper [Scaling
 
 ## Pre-trained models
 
-Description | Dataset | Model | Test set(s)
+Model | Description | Dataset | Download
 ---|---|---|---
-Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) | newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2)
-Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2) | newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2)
+`transformer.wmt14.en-fr` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2)
+`transformer.wmt16.en-de` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2)
 
 ## Training a new model on WMT'16 En-De
 
@@ -15,33 +15,33 @@ Please first download the [preprocessed WMT'16 En-De data provided by Google](ht
 Then:
 
 1. Extract the WMT'16 En-De data:
-```
-$ TEXT=wmt16_en_de_bpe32k
-$ mkdir $TEXT
-$ tar -xzvf wmt16_en_de.tar.gz -C $TEXT
+```bash
+TEXT=wmt16_en_de_bpe32k
+mkdir $TEXT
+tar -xzvf wmt16_en_de.tar.gz -C $TEXT
 ```
 
 2. Preprocess the dataset with a joined dictionary:
-```
-$ fairseq-preprocess --source-lang en --target-lang de \
-  --trainpref $TEXT/train.tok.clean.bpe.32000 \
-  --validpref $TEXT/newstest2013.tok.bpe.32000 \
-  --testpref $TEXT/newstest2014.tok.bpe.32000 \
-  --destdir data-bin/wmt16_en_de_bpe32k \
-  --nwordssrc 32768 --nwordstgt 32768 \
-  --joined-dictionary
+```bash
+fairseq-preprocess --source-lang en --target-lang de \
+    --trainpref $TEXT/train.tok.clean.bpe.32000 \
+    --validpref $TEXT/newstest2013.tok.bpe.32000 \
+    --testpref $TEXT/newstest2014.tok.bpe.32000 \
+    --destdir data-bin/wmt16_en_de_bpe32k \
+    --nwordssrc 32768 --nwordstgt 32768 \
+    --joined-dictionary
 ```
 
 3. Train a model:
-```
-$ fairseq-train data-bin/wmt16_en_de_bpe32k \
-  --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
-  --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
-  --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
-  --lr 0.0005 --min-lr 1e-09 \
-  --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-  --max-tokens 3584 \
-  --fp16
+```bash
+fairseq-train data-bin/wmt16_en_de_bpe32k \
+    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
+    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
+    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
+    --lr 0.0005 --min-lr 1e-09 \
+    --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --max-tokens 3584 \
+    --fp16
 ```
 
 Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU.
diff --git a/examples/stories/README.md b/examples/stories/README.md
index 29653054f8..625439e81a 100644
--- a/examples/stories/README.md
+++ b/examples/stories/README.md
@@ -14,7 +14,7 @@ We provide sample stories generated by the [convolutional seq2seq model](https:/
 
 The dataset can be downloaded like this:
 
-```
+```bash
 cd examples/stories
 curl https://dl.fbaipublicfiles.com/fairseq/data/writingPrompts.tar.gz | tar xvzf -
 ```
@@ -23,28 +23,28 @@ and contains a train, test, and valid split. The dataset is described here: http
 
 ## Example usage
 
+First we will preprocess the dataset. Note that the dataset release is the full data, but the paper models the first 1000 words of each story. Here is example code that trims the dataset to the first 1000 words of each story:
+```python
+data = ["train", "test", "valid"]
+for name in data:
+    with open(name + ".wp_target") as f:
+        stories = f.readlines()
+    stories = [" ".join(i.split()[0:1000]) for i in stories]
+    with open(name + ".wp_target", "w") as o:
+        for line in stories:
+            o.write(line.strip() + "\n")
 ```
-# Preprocess the dataset:
-# Note that the dataset release is the full data, but the paper models the first 1000 words of each story
-# Here is some example code that can trim the dataset to the first 1000 words of each story
-$ python
-$ data = ["train", "test", "valid"]
-$ for name in data:
-$   with open(name + ".wp_target") as f:
-$     stories = f.readlines()
-$   stories = [" ".join(i.split()[0:1000]) for i in stories]
-$   with open(name + ".wp_target", "w") as o:
-$     for line in stories:
-$       o.write(line.strip() + "\n")
 
+Once we've trimmed the data we can binarize it and train our model:
+```bash
 # Binarize the dataset:
-$ export TEXT=examples/stories/writingPrompts
-$ fairseq-preprocess --source-lang wp_source --target-lang wp_target \
-  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
-  --destdir data-bin/writingPrompts --padding-factor 1 --thresholdtgt 10 --thresholdsrc 10
+export TEXT=examples/stories/writingPrompts
+fairseq-preprocess --source-lang wp_source --target-lang wp_target \
+    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
+    --destdir data-bin/writingPrompts --padding-factor 1 --thresholdtgt 10 --thresholdsrc 10
 
 # Train the model:
-$ fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained False
+fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained False
 
 # Train a fusion model:
 # add the arguments: --pretrained True --pretrained-checkpoint path/to/checkpoint
@@ -52,7 +52,7 @@ $ fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-no
 # Generate:
 # Note: to load the pretrained model at generation time, you need to pass in a model-override argument to communicate to the fusion model at generation time where you have placed the pretrained checkpoint. By default, it will load the exact path of the fusion model's pretrained model from training time. You should use model-override if you have moved the pretrained model (or are using our provided models). If you are generating from a non-fusion model, the model-override argument is not necessary.
 
-$ fairseq-generate data-bin/writingPrompts --path /path/to/trained/model/checkpoint_best.pt --batch-size 32 --beam 1 --sampling --sampling-topk 10 --sampling-temperature 0.8 --nbest 1 --model-overrides "{'pretrained_checkpoint':'/path/to/pretrained/model/checkpoint'}"
+fairseq-generate data-bin/writingPrompts --path /path/to/trained/model/checkpoint_best.pt --batch-size 32 --beam 1 --sampling --sampling-topk 10 --sampling-temperature 0.8 --nbest 1 --model-overrides "{'pretrained_checkpoint':'/path/to/pretrained/model/checkpoint'}"
 ```
 
 ## Citation
diff --git a/examples/translation/README.md b/examples/translation/README.md
index 72f8b16178..a43f0af1ad 100644
--- a/examples/translation/README.md
+++ b/examples/translation/README.md
@@ -2,57 +2,58 @@
 
 ## Pre-trained models
 
-Description | Dataset | Model | Test set(s)
+Model | Description | Dataset | Download
 ---|---|---|---
-Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.newstest2014.tar.bz2) <br> newstest2012/2013: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.ntst1213.tar.bz2)
-Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-de.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-de.newstest2014.tar.bz2)
-Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT17 English-German](http://statmt.org/wmt17/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt17.v2.en-de.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt17.v2.en-de.newstest2014.tar.bz2)
-Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) | newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2)
-Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2) | newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2)
-Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381); WMT'18 winner) | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) | See NOTE in the archive
+`conv.wmt14.en-fr` | Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.newstest2014.tar.bz2) <br> newstest2012/2013: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.ntst1213.tar.bz2)
+`conv.wmt14.en-de` | Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-de.fconv-py.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-de.newstest2014.tar.bz2)
+`conv.wmt17.en-de` | Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT17 English-German](http://statmt.org/wmt17/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt17.v2.en-de.fconv-py.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt17.v2.en-de.newstest2014.tar.bz2)
+`transformer.wmt14.en-fr` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2)
+`transformer.wmt16.en-de` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2)
+`transformer.wmt18.en-de` | Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381)) <br> WMT'18 winner | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) <br> See NOTE in the archive
+`transformer.wmt19.en-de` | Transformer <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) <br> WMT'19 winner | [WMT'19 English-German](http://www.statmt.org/wmt19/translation-task.html) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz)
+`transformer.wmt19.de-en` | Transformer <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) <br> WMT'19 winner | [WMT'19 German-English](http://www.statmt.org/wmt19/translation-task.html) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.joined-dict.ensemble.tar.gz)
+`transformer.wmt19.en-ru` | Transformer <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) <br> WMT'19 winner | [WMT'19 English-Russian](http://www.statmt.org/wmt19/translation-task.html) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz)
+`transformer.wmt19.ru-en` | Transformer <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) <br> WMT'19 winner | [WMT'19 Russian-English](http://www.statmt.org/wmt19/translation-task.html) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ensemble.tar.gz)
 
 ## Example usage (torch.hub)
 
-Interactive generation via PyTorch Hub:
-```
->>> import torch
->>> torch.hub.list('pytorch/fairseq')
-[..., 'transformer.wmt14.en-fr', 'transformer.wmt16.en-de', 'transformer.wmt18.en-de', ... ]
->>> en2de = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt16.en-de',
-...   data_name_or_path='.',
-...   tokenizer='moses',
-...   bpe='subword_nmt',
-... )
->>> print(en2de.models[0].__class__)
-<class 'fairseq.models.transformer.TransformerModel'>
->>> print(en2de.generate('Hello world!'))
-Hallo Welt!
-```
+Interactive translation via PyTorch Hub:
+```python
+import torch
+
+# List available models
+torch.hub.list('pytorch/fairseq')  # [..., 'transformer.wmt16.en-de', ... ]
+
+# Load a transformer trained on WMT'16 En-De
+en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt16.en-de', tokenizer='moses', bpe='subword_nmt')
 
-Available models are listed in the ``hub_models()`` method in each model file, for example:
-[transformer.py](https://github.com/pytorch/fairseq/blob/master/fairseq/models/transformer.py).
+# The underlying model is available under the *models* attribute
+assert isinstance(en2de.models[0], fairseq.models.transformer.TransformerModel)
+
+# Translate a sentence
+en2de.translate('Hello world!')
+# 'Hallo Welt!'
+```
 
 ## Example usage (CLI tools)
 
 Generation with the binarized test sets can be run in batch mode as follows, e.g. for WMT 2014 English-French on a GTX-1080ti:
-```
-$ mkdir -p data-bin
-$ curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin
-$ curl https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.newstest2014.tar.bz2 | tar xvjf - -C data-bin
-$ fairseq-generate data-bin/wmt14.en-fr.newstest2014  \
-  --path data-bin/wmt14.en-fr.fconv-py/model.pt \
-  --beam 5 --batch-size 128 --remove-bpe | tee /tmp/gen.out
-...
-| Translated 3003 sentences (96311 tokens) in 166.0s (580.04 tokens/s)
-| Generate test with beam=5: BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
+```bash
+mkdir -p data-bin
+curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin
+curl https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.newstest2014.tar.bz2 | tar xvjf - -C data-bin
+fairseq-generate data-bin/wmt14.en-fr.newstest2014  \
+    --path data-bin/wmt14.en-fr.fconv-py/model.pt \
+    --beam 5 --batch-size 128 --remove-bpe | tee /tmp/gen.out
+# ...
+# | Translated 3003 sentences (96311 tokens) in 166.0s (580.04 tokens/s)
+# | Generate test with beam=5: BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
 
 # Compute BLEU score
-$ grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
-$ grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref
-$ fairseq-score --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
-BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
+grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
+grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref
+fairseq-score --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
+# BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
 ```
 
 ## Preprocessing
@@ -64,55 +65,54 @@ These scripts provide an example of pre-processing data for the NMT task.
 Provides an example of pre-processing for IWSLT'14 German to English translation task: ["Report on the 11th IWSLT evaluation campaign" by Cettolo et al.](http://workshop2014.iwslt.org/downloads/proceeding.pdf)
 
 Example usage:
-```
-$ cd examples/translation/
-$ bash prepare-iwslt14.sh
-$ cd ../..
+```bash
+cd examples/translation/
+bash prepare-iwslt14.sh
+cd ../..
 
 # Binarize the dataset:
-$ TEXT=examples/translation/iwslt14.tokenized.de-en
-$ fairseq-preprocess --source-lang de --target-lang en \
-  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
-  --destdir data-bin/iwslt14.tokenized.de-en
+TEXT=examples/translation/iwslt14.tokenized.de-en
+fairseq-preprocess --source-lang de --target-lang en \
+    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
+    --destdir data-bin/iwslt14.tokenized.de-en
 
 # Train the model (better for a single GPU setup):
-$ mkdir -p checkpoints/fconv
-$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
-  --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
-  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-  --lr-scheduler fixed --force-anneal 200 \
-  --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
+mkdir -p checkpoints/fconv
+CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
+    --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
+    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --lr-scheduler fixed --force-anneal 200 \
+    --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
 
 # Generate:
-$ fairseq-generate data-bin/iwslt14.tokenized.de-en \
-  --path checkpoints/fconv/checkpoint_best.pt \
-  --batch-size 128 --beam 5 --remove-bpe
+fairseq-generate data-bin/iwslt14.tokenized.de-en \
+    --path checkpoints/fconv/checkpoint_best.pt \
+    --batch-size 128 --beam 5 --remove-bpe
 
 ```
 
 To train transformer model on IWSLT'14 German to English:
-```
+```bash
 # Preparation steps are the same as for fconv model.
 
 # Train the model (better for a single GPU setup):
-$ mkdir -p checkpoints/transformer
-$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
-  -a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s de -t en \
-  --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \
-  --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
-  --criterion label_smoothed_cross_entropy --max-update 50000 \
-  --warmup-updates 4000 --warmup-init-lr '1e-07' \
-  --adam-betas '(0.9, 0.98)' --save-dir checkpoints/transformer
+mkdir -p checkpoints/transformer
+CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
+    -a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s de -t en \
+    --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \
+    --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
+    --criterion label_smoothed_cross_entropy --max-update 50000 \
+    --warmup-updates 4000 --warmup-init-lr '1e-07' \
+    --adam-betas '(0.9, 0.98)' --save-dir checkpoints/transformer
 
 # Average 10 latest checkpoints:
-$ python scripts/average_checkpoints.py --inputs checkpoints/transformer \
-   --num-epoch-checkpoints 10 --output checkpoints/transformer/model.pt
+python scripts/average_checkpoints.py --inputs checkpoints/transformer \
+    --num-epoch-checkpoints 10 --output checkpoints/transformer/model.pt
 
 # Generate:
-$ fairseq-generate data-bin/iwslt14.tokenized.de-en \
-  --path checkpoints/transformer/model.pt \
-  --batch-size 128 --beam 5 --remove-bpe
-
+fairseq-generate data-bin/iwslt14.tokenized.de-en \
+    --path checkpoints/transformer/model.pt \
+    --batch-size 128 --beam 5 --remove-bpe
 ```
 
 ### prepare-wmt14en2de.sh
@@ -122,36 +122,35 @@ By default it will produce a dataset that was modeled after ["Attention Is All Y
 
 To use only data available in WMT'14 or to replicate results obtained in the original ["Convolutional Sequence to Sequence Learning" (Gehring et al., 2017)](https://arxiv.org/abs/1705.03122) paper, please use the `--icml17` option.
 
-```
-$ bash prepare-wmt14en2de.sh --icml17
+```bash
+bash prepare-wmt14en2de.sh --icml17
 ```
 
 Example usage:
 
-```
-$ cd examples/translation/
-$ bash prepare-wmt14en2de.sh
-$ cd ../..
+```bash
+cd examples/translation/
+bash prepare-wmt14en2de.sh
+cd ../..
 
 # Binarize the dataset:
-$ TEXT=examples/translation/wmt17_en_de
-$ fairseq-preprocess --source-lang en --target-lang de \
-  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
-  --destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0
+TEXT=examples/translation/wmt17_en_de
+fairseq-preprocess --source-lang en --target-lang de \
+    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
+    --destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0
 
 # Train the model:
 # If it runs out of memory, try to set --max-tokens 1500 instead
-$ mkdir -p checkpoints/fconv_wmt_en_de
-$ fairseq-train data-bin/wmt17_en_de \
-  --lr 0.5 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
-  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-  --lr-scheduler fixed --force-anneal 50 \
-  --arch fconv_wmt_en_de --save-dir checkpoints/fconv_wmt_en_de
+mkdir -p checkpoints/fconv_wmt_en_de
+fairseq-train data-bin/wmt17_en_de \
+    --lr 0.5 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
+    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --lr-scheduler fixed --force-anneal 50 \
+    --arch fconv_wmt_en_de --save-dir checkpoints/fconv_wmt_en_de
 
 # Generate:
-$ fairseq-generate data-bin/wmt17_en_de \
-  --path checkpoints/fconv_wmt_en_de/checkpoint_best.pt --beam 5 --remove-bpe
-
+fairseq-generate data-bin/wmt17_en_de \
+    --path checkpoints/fconv_wmt_en_de/checkpoint_best.pt --beam 5 --remove-bpe
 ```
 
 ### prepare-wmt14en2fr.sh
@@ -160,30 +159,29 @@ Provides an example of pre-processing for the WMT'14 English to French translati
 
 Example usage:
 
-```
-$ cd examples/translation/
-$ bash prepare-wmt14en2fr.sh
-$ cd ../..
+```bash
+cd examples/translation/
+bash prepare-wmt14en2fr.sh
+cd ../..
 
 # Binarize the dataset:
-$ TEXT=examples/translation/wmt14_en_fr
-$ fairseq-preprocess --source-lang en --target-lang fr \
-  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
-  --destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0
+TEXT=examples/translation/wmt14_en_fr
+fairseq-preprocess --source-lang en --target-lang fr \
+    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
+    --destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0
 
 # Train the model:
 # If it runs out of memory, try to set --max-tokens 1000 instead
-$ mkdir -p checkpoints/fconv_wmt_en_fr
-$ fairseq-train data-bin/wmt14_en_fr \
-  --lr 0.5 --clip-norm 0.1 --dropout 0.1 --max-tokens 3000 \
-  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
-  --lr-scheduler fixed --force-anneal 50 \
-  --arch fconv_wmt_en_fr --save-dir checkpoints/fconv_wmt_en_fr
+mkdir -p checkpoints/fconv_wmt_en_fr
+fairseq-train data-bin/wmt14_en_fr \
+    --lr 0.5 --clip-norm 0.1 --dropout 0.1 --max-tokens 3000 \
+    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --lr-scheduler fixed --force-anneal 50 \
+    --arch fconv_wmt_en_fr --save-dir checkpoints/fconv_wmt_en_fr
 
 # Generate:
-$ fairseq-generate data-bin/fconv_wmt_en_fr \
-  --path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt --beam 5 --remove-bpe
-
+fairseq-generate data-bin/fconv_wmt_en_fr \
+    --path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt --beam 5 --remove-bpe
 ```
 
 ## Multilingual Translation
@@ -195,64 +193,64 @@ Note that we use slightly different preprocessing here than for the IWSLT'14
 En-De data above. In particular we learn a joint BPE code for all three
 languages and use interactive.py and sacrebleu for scoring the test set.
 
-```
+```bash
 # First install sacrebleu and sentencepiece
-$ pip install sacrebleu sentencepiece
+pip install sacrebleu sentencepiece
 
 # Then download and preprocess the data
-$ cd examples/translation/
-$ bash prepare-iwslt17-multilingual.sh
-$ cd ../..
+cd examples/translation/
+bash prepare-iwslt17-multilingual.sh
+cd ../..
 
 # Binarize the de-en dataset
-$ TEXT=examples/translation/iwslt17.de_fr.en.bpe16k
-$ fairseq-preprocess --source-lang de --target-lang en \
-  --trainpref $TEXT/train.bpe.de-en --validpref $TEXT/valid.bpe.de-en \
-  --joined-dictionary \
-  --destdir data-bin/iwslt17.de_fr.en.bpe16k \
-  --workers 10
+TEXT=examples/translation/iwslt17.de_fr.en.bpe16k
+fairseq-preprocess --source-lang de --target-lang en \
+    --trainpref $TEXT/train.bpe.de-en --validpref $TEXT/valid.bpe.de-en \
+    --joined-dictionary \
+    --destdir data-bin/iwslt17.de_fr.en.bpe16k \
+    --workers 10
 
 # Binarize the fr-en dataset
 # NOTE: it's important to reuse the en dictionary from the previous step
-$ fairseq-preprocess --source-lang fr --target-lang en \
-  --trainpref $TEXT/train.bpe.fr-en --validpref $TEXT/valid.bpe.fr-en \
-  --joined-dictionary --tgtdict data-bin/iwslt17.de_fr.en.bpe16k/dict.en.txt \
-  --destdir data-bin/iwslt17.de_fr.en.bpe16k \
-  --workers 10
+fairseq-preprocess --source-lang fr --target-lang en \
+    --trainpref $TEXT/train.bpe.fr-en --validpref $TEXT/valid.bpe.fr-en \
+    --joined-dictionary --tgtdict data-bin/iwslt17.de_fr.en.bpe16k/dict.en.txt \
+    --destdir data-bin/iwslt17.de_fr.en.bpe16k \
+    --workers 10
 
 # Train a multilingual transformer model
 # NOTE: the command below assumes 1 GPU, but accumulates gradients from
 #       8 fwd/bwd passes to simulate training on 8 GPUs
-$ mkdir -p checkpoints/multilingual_transformer
-$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt17.de_fr.en.bpe16k/ \
-  --max-epoch 50 \
-  --ddp-backend=no_c10d \
-  --task multilingual_translation --lang-pairs de-en,fr-en \
-  --arch multilingual_transformer_iwslt_de_en \
-  --share-decoders --share-decoder-input-output-embed \
-  --optimizer adam --adam-betas '(0.9, 0.98)' \
-  --lr 0.0005 --lr-scheduler inverse_sqrt --min-lr '1e-09' \
-  --warmup-updates 4000 --warmup-init-lr '1e-07' \
-  --label-smoothing 0.1 --criterion label_smoothed_cross_entropy \
-  --dropout 0.3 --weight-decay 0.0001 \
-  --save-dir checkpoints/multilingual_transformer \
-  --max-tokens 4000 \
-  --update-freq 8
+mkdir -p checkpoints/multilingual_transformer
+CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt17.de_fr.en.bpe16k/ \
+    --max-epoch 50 \
+    --ddp-backend=no_c10d \
+    --task multilingual_translation --lang-pairs de-en,fr-en \
+    --arch multilingual_transformer_iwslt_de_en \
+    --share-decoders --share-decoder-input-output-embed \
+    --optimizer adam --adam-betas '(0.9, 0.98)' \
+    --lr 0.0005 --lr-scheduler inverse_sqrt --min-lr '1e-09' \
+    --warmup-updates 4000 --warmup-init-lr '1e-07' \
+    --label-smoothing 0.1 --criterion label_smoothed_cross_entropy \
+    --dropout 0.3 --weight-decay 0.0001 \
+    --save-dir checkpoints/multilingual_transformer \
+    --max-tokens 4000 \
+    --update-freq 8
 
 # Generate and score the test set with sacrebleu
-$ SRC=de
-$ sacrebleu --test-set iwslt17 --language-pair ${SRC}-en --echo src \
-  | python scripts/spm_encode.py --model examples/translation/iwslt17.de_fr.en.bpe16k/sentencepiece.bpe.model \
-  > iwslt17.test.${SRC}-en.${SRC}.bpe
-$ cat iwslt17.test.${SRC}-en.${SRC}.bpe \
-  | fairseq-interactive data-bin/iwslt17.de_fr.en.bpe16k/ \
+SRC=de
+sacrebleu --test-set iwslt17 --language-pair ${SRC}-en --echo src \
+    | python scripts/spm_encode.py --model examples/translation/iwslt17.de_fr.en.bpe16k/sentencepiece.bpe.model \
+    > iwslt17.test.${SRC}-en.${SRC}.bpe
+cat iwslt17.test.${SRC}-en.${SRC}.bpe \
+    | fairseq-interactive data-bin/iwslt17.de_fr.en.bpe16k/ \
       --task multilingual_translation --source-lang ${SRC} --target-lang en \
       --path checkpoints/multilingual_transformer/checkpoint_best.pt \
       --buffer 2000 --batch-size 128 \
       --beam 5 --remove-bpe=sentencepiece \
-  > iwslt17.test.${SRC}-en.en.sys
-$ grep ^H iwslt17.test.${SRC}-en.en.sys | cut -f3 \
-  | sacrebleu --test-set iwslt17 --language-pair ${SRC}-en
+    > iwslt17.test.${SRC}-en.en.sys
+grep ^H iwslt17.test.${SRC}-en.en.sys | cut -f3 \
+    | sacrebleu --test-set iwslt17 --language-pair ${SRC}-en
 ```
 
 ### Argument format during inference
diff --git a/examples/translation_moe/README.md b/examples/translation_moe/README.md
index 4fc027e9c7..842be56bea 100644
--- a/examples/translation_moe/README.md
+++ b/examples/translation_moe/README.md
@@ -14,47 +14,47 @@ Use the `--method` flag to choose the MoE variant; we support hard mixtures with
 The model is trained with online responsibility assignment and shared parameterization.
 
 The following command will train a `hMoElp` model with `3` experts:
-```
-$ fairseq-train --ddp-backend='no_c10d' \
-  data-bin/wmt17_en_de \
-  --max-update 100000 \
-  --task translation_moe \
-  --method hMoElp --mean-pool-gating-network \
-  --num-experts 3 \
-  --arch transformer_wmt_en_de --share-all-embeddings \
-  --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
-  --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
-  --lr 0.0007 --min-lr 1e-09 \
-  --dropout 0.1 --weight-decay 0.0 --criterion cross_entropy \
-  --max-tokens 3584
+```bash
+fairseq-train --ddp-backend='no_c10d' \
+    data-bin/wmt17_en_de \
+    --max-update 100000 \
+    --task translation_moe \
+    --method hMoElp --mean-pool-gating-network \
+    --num-experts 3 \
+    --arch transformer_wmt_en_de --share-all-embeddings \
+    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
+    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
+    --lr 0.0007 --min-lr 1e-09 \
+    --dropout 0.1 --weight-decay 0.0 --criterion cross_entropy \
+    --max-tokens 3584
 ```
 
 ## Translate
 
 Once a model is trained, we can generate translations from different experts using the `--gen-expert` option.
 For example, to generate from expert 0:
-```
-$ fairseq-generate data-bin/wmt17_en_de \
-  --path checkpoints/checkpoint_best.pt \
-  --beam 1 --remove-bpe \
-  --task translation_moe \
-  --method hMoElp --mean-pool-gating-network \
-  --num-experts 3 \
-  --gen-expert 0
+```bash
+fairseq-generate data-bin/wmt17_en_de \
+    --path checkpoints/checkpoint_best.pt \
+    --beam 1 --remove-bpe \
+    --task translation_moe \
+    --method hMoElp --mean-pool-gating-network \
+    --num-experts 3 \
+    --gen-expert 0
 ```
 
 ## Evaluate
 
 First download a tokenized version of the WMT'14 En-De test set with multiple references:
-```
-$ wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
+```bash
+wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
 ```
 
 Next apply BPE on the fly and run generation for each expert:
-```
-$ BPEROOT=examples/translation/subword-nmt/
-$ BPE_CODE=examples/translation/wmt17_en_de/code
-$ for EXPERT in $(seq 0 2); do \
+```bash
+BPEROOT=examples/translation/subword-nmt/
+BPE_CODE=examples/translation/wmt17_en_de/code
+for EXPERT in $(seq 0 2); do \
     cat wmt14-en-de.extra_refs.tok \
     | grep ^S | cut -f 2 \
     | fairseq-interactive data-bin/wmt17_en_de \
@@ -66,15 +66,15 @@ $ for EXPERT in $(seq 0 2); do \
         --method hMoElp --mean-pool-gating-network \
         --num-experts 3 \
         --gen-expert $EXPERT ; \
-  done > wmt14-en-de.extra_refs.tok.gen.3experts
+done > wmt14-en-de.extra_refs.tok.gen.3experts
 ```
 
 Finally use `score_moe.py` to compute pairwise BLUE and average oracle BLEU:
-```
-$ python examples/translation_moe/score.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
-pairwise BLEU: 48.26
-#refs covered: 2.11
-multi-reference BLEU (leave-one-out): 59.46
+```bash
+python examples/translation_moe/score.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
+# pairwise BLEU: 48.26
+# #refs covered: 2.11
+# multi-reference BLEU (leave-one-out): 59.46
 ```
 This matches row 3 from Table 7 in the paper.
 
diff --git a/examples/wmt19/README.md b/examples/wmt19/README.md
index fff13fa6ac..6eb7818925 100644
--- a/examples/wmt19/README.md
+++ b/examples/wmt19/README.md
@@ -4,86 +4,52 @@ This page provides pointers to the models of Facebook-FAIR's WMT'19 news transla
 
 ## Pre-trained models
 
-Description | Model
----|---
-En->De Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz)
-De->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.joined-dict.ensemble.tar.gz)
-En->Ru Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz)
-Ru->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ensemble.tar.gz)
-En LM | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
-De LM | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
-Ru LM | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)
+Model | Description | Download
+---|---|---
+`transformer.wmt19.en-de` | En->De Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz)
+`transformer.wmt19.de-en` | De->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.joined-dict.ensemble.tar.gz)
+`transformer.wmt19.en-ru` | En->Ru Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz)
+`transformer.wmt19.ru-en` | Ru->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ensemble.tar.gz)
+`transformer_lm.wmt19.en` | En Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
+`transformer_lm.wmt19.de` | De Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
+`transformer_lm.wmt19.ru` | Ru Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)
 
 ## Example usage (torch.hub)
 
-```
->>> import torch
->>> en2de = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt19.en-de',
-...   checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> en2de.generate("Machine learning is great!")
-'Maschinelles Lernen ist großartig!'
+```python
+import torch
+
+# English to German translation
+en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+en2de.translate("Machine learning is great!")  # 'Maschinelles Lernen ist großartig!'
 
->>> de2en = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt19.de-en',
-...   checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> de2en.generate("Maschinelles Lernen ist großartig!")
-'Machine learning is great!'
+# German to English translation
+de2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.de-en', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+de2en.translate("Maschinelles Lernen ist großartig!")  # 'Machine learning is great!'
 
->>> en2ru = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt19.en-ru',
-...   checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> en2ru.generate("Machine learning is great!")
-'Машинное обучение - это здорово!'
+# English to Russian translation
+en2ru = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-ru', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+en2ru.translate("Machine learning is great!")  # 'Машинное обучение - это здорово!'
 
->>> ru2en = torch.hub.load(
-...   'pytorch/fairseq',
-...   'transformer.wmt19.ru-en',
-...   checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> ru2en.generate("Машинное обучение - это здорово!")
-'Machine learning is great!'
+# Russian to English translation
+ru2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.ru-en', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+ru2en.translate("Машинное обучение - это здорово!")  # 'Machine learning is great!'
 
->>> en_lm = torch.hub.load(
-...   'pytorch.fairseq',
-...   'transformer_lm.wmt19.en'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> en_lm.generate("Machine learning is")
-'Machine learning is the future of computing, says Microsoft boss Satya Nadella ...'
+# Sample from the English LM
+en_lm = torch.hub.load('pytorch.fairseq', 'transformer_lm.wmt19.en', tokenizer='moses', bpe='fastbpe')
+en_lm.sample("Machine learning is")  # 'Machine learning is the future of computing, says Microsoft boss Satya Nadella ...'
 
->>> de_lm = torch.hub.load(
-...   'pytorch.fairseq',
-...   'transformer_lm.wmt19.de'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> de_lm.generate("Maschinelles lernen ist")
-''Maschinelles lernen ist das A und O (neues-deutschland.de) Die Arbeitsbedingungen für Lehrerinnen und Lehrer sind seit Jahren verbesserungswürdig ...'
+# Sample from the German LM
+de_lm = torch.hub.load('pytorch.fairseq', 'transformer_lm.wmt19.de', tokenizer='moses', bpe='fastbpe')
+de_lm.sample("Maschinelles lernen ist")  # 'Maschinelles lernen ist das A und O (neues-deutschland.de) Die Arbeitsbedingungen für Lehrerinnen und Lehrer sind seit Jahren verbesserungswürdig ...'
 
->>> ru_lm = torch.hub.load(
-...   'pytorch.fairseq',
-...   'transformer_lm.wmt19.ru'
-...   tokenizer='moses',
-...   bpe='fastbpe',
-... )
->>> ru_lm.generate("машинное обучение это")
-'машинное обучение это то, что мы называем "искусственным интеллектом".'
+# Sample from the Russian LM
+ru_lm = torch.hub.load('pytorch.fairseq', 'transformer_lm.wmt19.ru', tokenizer='moses', bpe='fastbpe')
+ru_lm.sample("машинное обучение это")  # 'машинное обучение это то, что мы называем "искусственным интеллектом".'
 ```
 
 ## Citation
diff --git a/fairseq/data/encoders/moses_tokenizer.py b/fairseq/data/encoders/moses_tokenizer.py
index deed30d880..b1e7478b9d 100644
--- a/fairseq/data/encoders/moses_tokenizer.py
+++ b/fairseq/data/encoders/moses_tokenizer.py
@@ -12,9 +12,9 @@ class MosesTokenizer(object):
     @staticmethod
     def add_args(parser):
         # fmt: off
-        parser.add_argument('--moses-source-lang', default='en', metavar='SRC',
+        parser.add_argument('--moses-source-lang', metavar='SRC',
                             help='source language')
-        parser.add_argument('--moses-target-lang', default='en', metavar='TARGET',
+        parser.add_argument('--moses-target-lang', metavar='TARGET',
                             help='target language')
         parser.add_argument('--moses-no-dash-splits', action='store_true', default=False,
                             help='don\'t apply dash split rules')
@@ -24,6 +24,12 @@ def add_args(parser):
 
     def __init__(self, args):
         self.args = args
+
+        if getattr(args, 'moses_source_lang', None) is None:
+            args.moses_source_lang = getattr(args, 'source_lang', 'en')
+        if getattr(args, 'moses_target_lang', None) is None:
+            args.moses_target_lang = getattr(args, 'target_lang', 'en')
+
         try:
             from sacremoses import MosesTokenizer, MosesDetokenizer
             self.tok = MosesTokenizer(args.moses_source_lang)
diff --git a/fairseq/hub_utils.py b/fairseq/hub_utils.py
index 06a2c55723..73fdd94dc9 100644
--- a/fairseq/hub_utils.py
+++ b/fairseq/hub_utils.py
@@ -97,12 +97,15 @@ def __init__(self, args, task, models):
     def device(self):
         return self._float_tensor.device
 
-    def translate(self, sentence: str, verbose: bool = False, **kwargs) -> str:
+    def translate(self, sentence: str, beam: int = 5, verbose: bool = False, **kwargs) -> str:
+        return self.sample(sentence, beam, verbose, **kwargs)
+
+    def sample(self, sentence: str, beam: int = 1, verbose: bool = False, **kwargs) -> str:
         input = self.encode(sentence)
-        hypo = self.generate(input, verbose, **kwargs)
+        hypo = self.generate(input, beam, verbose, **kwargs)[0]['tokens']
         return self.decode(hypo)
 
-    def generate(self, tokens: torch.LongTensor, verbose: bool = False, **kwargs) -> torch.LongTensor:
+    def generate(self, tokens: torch.LongTensor, beam: int = 5, verbose: bool = False, **kwargs) -> torch.LongTensor:
         sample = self._build_sample(tokens)
 
         # build generator using current args as well as any kwargs
@@ -117,20 +120,24 @@ def generate(self, tokens: torch.LongTensor, verbose: bool = False, **kwargs) ->
             src_str_with_unk = self.string(tokens)
             print('S\t{}'.format(src_str_with_unk))
 
+        def getarg(name, default):
+            return getattr(gen_args, name, getattr(self.args, name, default))
+
         # Process top predictions
-        for hypo in translations[0][:min(len(translations), getattr(self.args, 'nbest', 1))]:
-            hypo_str = self.decode(hypo['tokens'])
-            if verbose:
+        hypos = translations[0]
+        if verbose:
+            for hypo in hypos:
+                hypo_str = self.decode(hypo['tokens'])
                 print('H\t{}\t{}'.format(hypo['score'], hypo_str))
                 print('P\t{}'.format(
                     ' '.join(map(lambda x: '{:.4f}'.format(x), hypo['positional_scores'].tolist()))
                 ))
-                if hypo['alignment'] is not None and getattr(self.args, 'print_alignment', False):
+                if hypo['alignment'] is not None and getarg('print_alignment', False):
                     print('A\t{}'.format(
                         ' '.join(map(lambda x: str(utils.item(x)), hypo['alignment'].int().cpu()))
                     ))
 
-        return hypo['tokens']
+        return hypos
 
     def encode(self, sentence: str) -> torch.LongTensor:
         sentence = self.tokenize(sentence)
diff --git a/hubconf.py b/hubconf.py
index 7e1574a684..ec27226da4 100644
--- a/hubconf.py
+++ b/hubconf.py
@@ -11,6 +11,7 @@
 
 
 dependencies = [
+    'fastBPE',
     'regex',
     'requests',
     'sacremoses',
diff --git a/setup.py b/setup.py
index 1fd3f6dd34..83b3a7ee54 100644
--- a/setup.py
+++ b/setup.py
@@ -44,7 +44,9 @@
     long_description_content_type='text/markdown',
     install_requires=[
         'cffi',
+        'fastBPE',
         'numpy',
+        'regex',
         'sacrebleu',
         'torch',
         'tqdm',