Update READMEs

Summary: Pull Request resolved: fairinternal/fairseq-py#823 Differential Revision: D16804995 Pulled By: myleott fbshipit-source-id: abac5dc0ed6b7bfe2309ba273456e54b37340b2c
LLL-Orleans · Aug 14, 2019 · b870468 · b870468
1 parent ffffe04
commit b870468
Show file tree

Hide file tree

Showing 8 changed files with 200 additions and 89 deletions.
diff --git a/README.md b/README.md
@@ -6,10 +6,10 @@ modeling and other text generation tasks.
 
 ### What's New:
 
+- August 2019: [WMT'19 models released](examples/wmt19/README.md)
 - July 2019: fairseq relicensed under MIT license
-- July 2019: [RoBERTa models and code release](examples/roberta/README.md)
-- June 2019: [wav2vec models and code release](examples/wav2vec/README.md)
-- April 2019: [fairseq demo paper @ NAACL 2019](https://arxiv.org/abs/1904.01038)
+- July 2019: [RoBERTa models and code released](examples/roberta/README.md)
+- June 2019: [wav2vec models and code released](examples/wav2vec/README.md)
 
 ### Features:
 
@@ -31,6 +31,7 @@ Fairseq provides reference implementations of various sequence-to-sequence model
   - [Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)](examples/language_model/transformer_lm/README.md)
   - [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md)
   - [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md)
+  - [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md)
 
 **Additionally:**
 - multi-GPU (distributed) training on one machine or across multiple machines
@@ -49,38 +50,33 @@ translation and language modeling datasets.
 
 # Requirements and Installation
 
-* [PyTorch](http://pytorch.org/) version >= 1.0.0
+* [PyTorch](http://pytorch.org/) version >= 1.1.0
 * Python version >= 3.5
 * For training new models, you'll also need an NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)
+* **For faster training** install NVIDIA's [apex](https://github.com/NVIDIA/apex) library with the `--cuda_ext` option
 
-Please follow the instructions here to install PyTorch: https://github.com/pytorch/pytorch#installation.
-
-If you use Docker make sure to increase the shared memory size either with
-`--ipc=host` or `--shm-size` as command line options to `nvidia-docker run`.
-
-After PyTorch is installed, you can install fairseq with `pip`:
-```
+To install fairseq:
+```bash
 pip install fairseq
 ```
-On MacOS,
-```
+
+On MacOS:
+```bash
 CFLAGS="-stdlib=libc++" pip install fairseq
 ```
+
+If you use Docker make sure to increase the shared memory size either with
+`--ipc=host` or `--shm-size` as command line options to `nvidia-docker run`.
+
 **Installing from source**
 
 To install fairseq from source and develop locally:
-```
+```bash
 git clone https://github.com/pytorch/fairseq
 cd fairseq
 pip install --editable .
 ```
 
-**Improved training speed**
-
-Training speed can be further improved by installing NVIDIA's
-[apex](https://github.com/NVIDIA/apex) library with the `--cuda_ext` option.
-fairseq will automatically switch to the faster modules provided by apex.
-
 # Getting Started
 
 The [full documentation](https://fairseq.readthedocs.io/) contains instructions
@@ -93,9 +89,10 @@ We provide pre-trained models and pre-processed, binarized test sets for several
 as well as example training and evaluation commands.
 
 - [Translation](examples/translation/README.md): convolutional and transformer models are available
-- [Language Modeling](examples/language_model/README.md): convolutional models are available
+- [Language Modeling](examples/language_model/README.md): convolutional and transformer models are available
 
 We also have more detailed READMEs to reproduce results from specific papers:
+- [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md)
 - [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md)
 - [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](examples/wav2vec/README.md)
 - [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md)

diff --git a/examples/language_model/README.md b/examples/language_model/README.md
@@ -27,58 +27,57 @@ en_lm.sample('Barack Obama', beam=1, sampling=True, sampling_topk=10, temperatur
 # "Barack Obama is coming to Sydney and New Zealand (...)"
 ```
 
-## Training a new model with the CLI tools
+## Training a transformer language model with the CLI tools
 
-These scripts provide an example of pre-processing data for the Language Modeling task.
+### 1) Preprocess the data
 
-### prepare-wikitext-103.sh
-
-Provides an example of pre-processing for [WikiText-103 language modeling task](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):
-
-Example usage:
-
-Prepare data:
+First download and prepare the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):
 ```bash
 cd examples/language_model/
 bash prepare-wikitext-103.sh
 cd ../..
+```
 
-# Binarize the dataset:
+Next preprocess/binarize the data:
+```bash
 TEXT=examples/language_model/wikitext-103
-
-fairseq-preprocess --only-source \
-    --trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \ 
-    --destdir data-bin/wikitext-103
+fairseq-preprocess \
+    --only-source \
+    --trainpref $TEXT/wiki.train.tokens \
+    --validpref $TEXT/wiki.valid.tokens \
+    --testpref $TEXT/wiki.test.tokens \ 
+    --destdir data-bin/wikitext-103 \
+    --workers 20
 ```
 
-Train a transformer language model with adaptive inputs ([Baevski and Auli (2018): Adaptive Input Representations for Neural Language Modeling](transformer_lm/README.md)):
+### 2) Train a language model
+
+Next we'll train a transformer language model using [adaptive inputs](transformer_lm/README.md):
 ```bash
-# If it runs out of memory, try to reduce max-tokens and tokens-per-sample
-mkdir -p checkpoints/transformer_wikitext-103
-fairseq-train --task language_modeling data-bin/wikitext-103 \
-    --save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
+fairseq-train --task language_modeling \
+    data-bin/wikitext-103 \
+    --save-dir checkpoints/transformer_wikitext-103 \
+    --arch transformer_lm_wiki103 \
     --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
     --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
     --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
     --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d
-
-# Evaluate:
-fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
-    --sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
 ```
 
-Train a convolutional language model ([Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](conv_lm/README.md)):
-```
-# If it runs out of memory, try to reduce max-tokens and tokens-per-sample
-mkdir -p checkpoints/fconv_wikitext-103
-fairseq-train --task language_modeling data-bin/wikitext-103 \
-    --save-dir checkpoints/fconv_wikitext-103 \
-    --max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
-    --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
-    --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
-    --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
-    --ddp-backend=no_c10d
-
-# Evaluate:
-fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
+If the above command runs out of memory, try reducing `--max-tokens` (max number
+of tokens per batch) or `--tokens-per-sample` (max sequence length). You can
+also increase `--update-freq` to accumulate gradients and simulate training on
+more GPUs.
+
+### 3) Evaluate
+```bash
+fairseq-eval-lm data-bin/wikitext-103 \
+    --path checkpoints/transformer_wiki103/checkpoint_best.pt \
+    --sample-break-mode complete --max-tokens 3072 \
+    --context-window 2560 --softmax-batch 1024
 ```
+
+## Convolutional language models
+
+Please see the [convolutional LM README](conv_lm/README.md) for instructions to
+train convolutional language models.
diff --git a/examples/language_model/conv_lm/README.md b/examples/language_model/conv_lm/README.md
@@ -2,8 +2,27 @@
 
 ## Example usage
 
-See the [language modeling README](../README.md) for instructions on reproducing results for WikiText-103
-using the `fconv_lm_dauphin_wikitext103` model architecture.
+First download and preprocess the data following the main [language modeling
+README](../README.md).
+
+Then to train a convolutional LM using the `fconv_lm_dauphin_wikitext103`
+architecture:
+```bash
+fairseq-train --task language_modeling \
+    data-bin/wikitext-103 \
+    --save-dir checkpoints/fconv_wikitext-103 \
+    --arch fconv_lm_dauphin_wikitext103 \
+    --max-epoch 35 \ --optimizer nag \
+    --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
+    --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
+    --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
+    --ddp-backend=no_c10d
+```
+
+And evaluate with:
+```bash
+fairseq-eval-lm data-bin/wikitext-103 --path checkpoints/fconv_wiki103/checkpoint_best.pt
+```
 
 ## Citation
 

diff --git a/examples/language_model/transformer_lm/README.md b/examples/language_model/transformer_lm/README.md
@@ -1,4 +1,4 @@
-# Adaptive Input Representations for Neural Language Modeling (Baevski and Auli; 2018)
+# Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)
 
 ## Pre-trained models
 

diff --git a/examples/roberta/README.cqa.md b/examples/roberta/README.cqa.md
@@ -8,7 +8,7 @@ representations through a fully-connected layer to predict the correct answer.
 We train with a standard cross-entropy loss.
 
 We also found it helpful to prepend a prefix of `Q:` to the question and `A:` to
-the input. The complete input format is:
+the answer. The complete input format is:
 ```
 <s> Q: Where would I not want a fox? </s> A: hen house </s>
 ```
@@ -18,7 +18,7 @@ Our final submission is based on a hyperparameter search over the learning rate
 4000) and random seed. We selected the model with the best performance on the
 development set after 100 trials.
 
-### 1) Download the data from Commonsense QA website (https://www.tau-nlp.org/commonsenseqa)
+### 1) Download data from the Commonsense QA website (https://www.tau-nlp.org/commonsenseqa)
 ```bash
 bash examples/roberta/commonsense_qa/download_cqa_data.sh
 ```

diff --git a/examples/roberta/README.md b/examples/roberta/README.md
@@ -2,20 +2,24 @@
 
 https://arxiv.org/abs/1907.11692
 
-## Introduction
+### Introduction
 
-**RoBERTa** iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.
+RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.
 
-## Pre-trained models
+### What's New:
+
+- August 2019: Added [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).
+
+### Pre-trained models
 
 Model | Description | # params | Download
 ---|---|---|---
 `roberta.base` | RoBERTa using the BERT-base architecture | 125M | [roberta.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.base.tar.gz)
 `roberta.large` | RoBERTa using the BERT-large architecture | 355M | [roberta.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz)
 `roberta.large.mnli` | `roberta.large` finetuned on [MNLI](http://www.nyu.edu/projects/bowman/multinli) | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz)
-`roberta.large.wsc` | `roberta.large` finetuned on [WSC](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)
+`roberta.large.wsc` | `roberta.large` finetuned on [WSC](README.wsc.md) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)
 
-## Results
+### Results
 
 ##### Results on GLUE tasks (dev set, single model, single-task finetuning)
 
@@ -44,7 +48,7 @@ Model | Accuracy | Middle | High
 ---|---|---|---
 `roberta.large` | 83.2 | 86.5 | 81.3
 
-## Example usage
+### Example usage
 
 ##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
 ```python
@@ -53,15 +57,15 @@ roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
 roberta.eval()  # disable dropout (or leave in train mode to finetune)
 ```
 
-##### Load RoBERTa (for PyTorch 1.0):
+##### Load RoBERTa (for PyTorch 1.0 or custom models):
 ```python
 # Download roberta.large model
 wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
 tar -xzvf roberta.large.tar.gz
 
 # Load the model in fairseq
 from fairseq.models.roberta import RobertaModel
-roberta = RobertaModel.from_pretrained('/path/to/roberta.large')
+roberta = RobertaModel.from_pretrained('/path/to/roberta.large', checkpoint_file='model.pt')
 roberta.eval()  # disable dropout (or leave in train mode to finetune)
 ```
 
@@ -120,7 +124,7 @@ roberta.cuda()
 roberta.predict('new_task', tokens)  # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
 ```
 
-## Advanced usage
+### Advanced usage
 
 #### Filling masks:
 
@@ -212,24 +216,19 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
 # Expected output: 0.9060
 ```
 
-
-## Finetuning
+### Finetuning
 
 - [Finetuning on GLUE](README.glue.md)
 - [Finetuning on custom classification tasks (e.g., IMDB)](README.custom_classification.md)
 - [Finetuning on Winograd Schema Challenge (WSC)](README.wsc.md)
 - [Finetuning on Commonsense QA (CQA)](README.cqa.md)
 - Finetuning on SQuAD: coming soon
 
-## Pretraining using your own data
-
-You can use the [`masked_lm` task](/fairseq/tasks/masked_lm.py) to pretrain RoBERTa from scratch, or to continue pretraining RoBERTa starting from one of the released checkpoints.
-
-Data should be preprocessed following the [language modeling example](/examples/language_model).
+### Pretraining using your own data
 
-A more detailed tutorial is coming soon.
+See the [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).
 
-## Citation
+### Citation
 
 ```bibtex
 @article{liu2019roberta,