From ac66df47b5394e730aa05efa50ed7ec6103388bb Mon Sep 17 00:00:00 2001 From: Myle Ott Date: Thu, 15 Aug 2019 09:45:46 -0700 Subject: [PATCH] Update README Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/826 Differential Revision: D16830402 Pulled By: myleott fbshipit-source-id: 25afaa6d9de7b51cc884e3f417c8e6b349f5a7bc --- examples/roberta/README.md | 50 ++++++--- examples/roberta/README.pretraining.md | 2 +- examples/scaling_nmt/README.md | 36 ++++--- examples/translation/README.md | 141 ++++++++++++------------- 4 files changed, 129 insertions(+), 100 deletions(-) diff --git a/examples/roberta/README.md b/examples/roberta/README.md index 022ea0e3c1..15119a345a 100644 --- a/examples/roberta/README.md +++ b/examples/roberta/README.md @@ -2,7 +2,7 @@ https://arxiv.org/abs/1907.11692 -### Introduction +## Introduction RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details. @@ -10,7 +10,7 @@ RoBERTa iterates on BERT's pretraining procedure, including training the model l - August 2019: Added [tutorial for pretraining RoBERTa using your own data](README.pretraining.md). -### Pre-trained models +## Pre-trained models Model | Description | # params | Download ---|---|---|--- @@ -19,9 +19,10 @@ Model | Description | # params | Download `roberta.large.mnli` | `roberta.large` finetuned on [MNLI](http://www.nyu.edu/projects/bowman/multinli) | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz) `roberta.large.wsc` | `roberta.large` finetuned on [WSC](wsc/README.md) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz) -### Results +## Results -##### Results on GLUE tasks (dev set, single model, single-task finetuning) +**[GLUE (Wang et al., 2019)](https://gluebenchmark.com/)** +_(dev set, single model, single-task finetuning)_ Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B ---|---|---|---|---|---|---|---|--- @@ -29,26 +30,51 @@ Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B `roberta.large` | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4 `roberta.large.mnli` | 90.2 | - | - | - | - | - | - | - -##### Results on SuperGLUE tasks (dev set, single model, single-task finetuning) +**[SuperGLUE (Wang et al., 2019)](https://super.gluebenchmark.com/)** +_(dev set, single model, single-task finetuning)_ Model | BoolQ | CB | COPA | MultiRC | RTE | WiC | WSC ---|---|---|---|---|---|---|--- `roberta.large` | 86.9 | 98.2 | 94.0 | 85.7 | 89.5 | 75.6 | - `roberta.large.wsc` | - | - | - | - | - | - | 91.3 -##### Results on SQuAD (dev set) +**[SQuAD (Rajpurkar et al., 2018)](https://rajpurkar.github.io/SQuAD-explorer/)** +_(dev set, no additional data used)_ Model | SQuAD 1.1 EM/F1 | SQuAD 2.0 EM/F1 ---|---|--- `roberta.large` | 88.9/94.6 | 86.5/89.4 -##### Results on Reading Comprehension (RACE, test set) +**[RACE (Lai et al., 2017)](http://www.qizhexie.com/data/RACE_leaderboard.html)** +_(test set)_ Model | Accuracy | Middle | High ---|---|---|--- `roberta.large` | 83.2 | 86.5 | 81.3 -### Example usage +**[HellaSwag (Zellers et al., 2019)](https://rowanzellers.com/hellaswag/)** +_(test set)_ + +Model | Overall | In-domain | Zero-shot | ActivityNet | WikiHow +---|---|---|---|---|--- +`roberta.large` | 85.2 | 87.3 | 83.1 | 74.6 | 90.9 + +**[Commonsense QA (Talmor et al., 2019)](https://www.tau-nlp.org/commonsenseqa)** +_(test set)_ + +Model | Accuracy +---|--- +`roberta.large` (single model) | 72.1 +`roberta.large` (ensemble) | 72.5 + +**[Winogrande (Sakaguchi et al., 2019)](https://arxiv.org/abs/1907.10641)** +_(test set)_ + +Model | Accuracy +---|--- +`roberta.large` | 78.1 + +## Example usage ##### Load RoBERTa from torch.hub (PyTorch >= 1.1): ```python @@ -124,7 +150,7 @@ roberta.cuda() roberta.predict('new_task', tokens) # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=) ``` -### Advanced usage +## Advanced usage #### Filling masks: @@ -216,7 +242,7 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples)) # Expected output: 0.9060 ``` -### Finetuning +## Finetuning - [Finetuning on GLUE](README.glue.md) - [Finetuning on custom classification tasks (e.g., IMDB)](README.custom_classification.md) @@ -224,11 +250,11 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples)) - [Finetuning on Commonsense QA (CQA)](commonsense_qa/README.md) - Finetuning on SQuAD: coming soon -### Pretraining using your own data +## Pretraining using your own data See the [tutorial for pretraining RoBERTa using your own data](README.pretraining.md). -### Citation +## Citation ```bibtex @article{liu2019roberta, diff --git a/examples/roberta/README.pretraining.md b/examples/roberta/README.pretraining.md index 843d7ce377..0e82bc93fb 100644 --- a/examples/roberta/README.pretraining.md +++ b/examples/roberta/README.pretraining.md @@ -2,7 +2,7 @@ This tutorial will walk you through pretraining RoBERTa over your own data. -### 1) Preprocess the data. +### 1) Preprocess the data Data should be preprocessed following the [language modeling format](/examples/language_model). diff --git a/examples/scaling_nmt/README.md b/examples/scaling_nmt/README.md index 1e47917baf..a1d40ea623 100644 --- a/examples/scaling_nmt/README.md +++ b/examples/scaling_nmt/README.md @@ -11,45 +11,57 @@ Model | Description | Dataset | Download ## Training a new model on WMT'16 En-De -Please first download the [preprocessed WMT'16 En-De data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8). +First download the [preprocessed WMT'16 En-De data provided by Google](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8). + Then: -1. Extract the WMT'16 En-De data: +##### 1. Extract the WMT'16 En-De data ```bash TEXT=wmt16_en_de_bpe32k mkdir -p $TEXT tar -xzvf wmt16_en_de.tar.gz -C $TEXT ``` -2. Preprocess the dataset with a joined dictionary: +##### 2. Preprocess the dataset with a joined dictionary ```bash -fairseq-preprocess --source-lang en --target-lang de \ +fairseq-preprocess \ + --source-lang en --target-lang de \ --trainpref $TEXT/train.tok.clean.bpe.32000 \ --validpref $TEXT/newstest2013.tok.bpe.32000 \ --testpref $TEXT/newstest2014.tok.bpe.32000 \ --destdir data-bin/wmt16_en_de_bpe32k \ --nwordssrc 32768 --nwordstgt 32768 \ - --joined-dictionary + --joined-dictionary \ + --workers 20 ``` -3. Train a model: +##### 3. Train a model ```bash -fairseq-train data-bin/wmt16_en_de_bpe32k \ +fairseq-train \ + data-bin/wmt16_en_de_bpe32k \ --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ - --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \ - --lr 0.0005 --min-lr 1e-09 \ - --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ + --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \ + --dropout 0.3 --weight-decay 0.0 \ + --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 3584 \ --fp16 ``` -Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU. +Note that the `--fp16` flag requires you have CUDA 9.1 or greater and a Volta GPU or newer. If you want to train the above model with big batches (assuming your machine has 8 GPUs): -- add `--update-freq 16` to simulate training on 8*16=128 GPUs +- add `--update-freq 16` to simulate training on 8x16=128 GPUs - increase the learning rate; 0.001 works well for big batches +##### 4. Evaluate +```bash +fairseq-generate \ + data-bin/wmt16_en_de_bpe32k \ + --path checkpoints/checkpoint_best.pt \ + --beam 4 --lenpen 0.6 --remove-bpe +``` + ## Citation ```bibtex diff --git a/examples/translation/README.md b/examples/translation/README.md index a43f0af1ad..b93115147a 100644 --- a/examples/translation/README.md +++ b/examples/translation/README.md @@ -1,5 +1,8 @@ # Neural Machine Translation +This README contains instructions for [using pretrained translation models](#example-usage-torchhub) +as well as [training new models](#training-a-new-model). + ## Pre-trained models Model | Description | Dataset | Download @@ -56,132 +59,119 @@ fairseq-score --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref # BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787) ``` -## Preprocessing - -These scripts provide an example of pre-processing data for the NMT task. +## Training a new model -### prepare-iwslt14.sh +### IWSLT'14 German to English (Transformer) -Provides an example of pre-processing for IWSLT'14 German to English translation task: ["Report on the 11th IWSLT evaluation campaign" by Cettolo et al.](http://workshop2014.iwslt.org/downloads/proceeding.pdf) +The following instructions can be used to train a Transformer model on the [IWSLT'14 German to English dataset](http://workshop2014.iwslt.org/downloads/proceeding.pdf). -Example usage: +First download and preprocess the data: ```bash +# Download and prepare the data cd examples/translation/ bash prepare-iwslt14.sh cd ../.. -# Binarize the dataset: +# Preprocess/binarize the data TEXT=examples/translation/iwslt14.tokenized.de-en fairseq-preprocess --source-lang de --target-lang en \ --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ - --destdir data-bin/iwslt14.tokenized.de-en + --destdir data-bin/iwslt14.tokenized.de-en \ + --workers 20 +``` -# Train the model (better for a single GPU setup): -mkdir -p checkpoints/fconv -CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \ - --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \ +Next we'll train a Transformer translation model over this data: +```bash +CUDA_VISIBLE_DEVICES=0 fairseq-train \ + data-bin/iwslt14.tokenized.de-en \ + --arch transformer_iwslt_de_en --share-decoder-input-output-embed \ + --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ + --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ + --dropout 0.3 --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ - --lr-scheduler fixed --force-anneal 200 \ - --arch fconv_iwslt_de_en --save-dir checkpoints/fconv - -# Generate: -fairseq-generate data-bin/iwslt14.tokenized.de-en \ - --path checkpoints/fconv/checkpoint_best.pt \ - --batch-size 128 --beam 5 --remove-bpe - + --max-tokens 4096 ``` -To train transformer model on IWSLT'14 German to English: +Finally we can evaluate our trained model: ```bash -# Preparation steps are the same as for fconv model. - -# Train the model (better for a single GPU setup): -mkdir -p checkpoints/transformer -CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \ - -a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s de -t en \ - --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \ - --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \ - --criterion label_smoothed_cross_entropy --max-update 50000 \ - --warmup-updates 4000 --warmup-init-lr '1e-07' \ - --adam-betas '(0.9, 0.98)' --save-dir checkpoints/transformer - -# Average 10 latest checkpoints: -python scripts/average_checkpoints.py --inputs checkpoints/transformer \ - --num-epoch-checkpoints 10 --output checkpoints/transformer/model.pt - -# Generate: fairseq-generate data-bin/iwslt14.tokenized.de-en \ - --path checkpoints/transformer/model.pt \ + --path checkpoints/checkpoint_best.pt \ --batch-size 128 --beam 5 --remove-bpe ``` -### prepare-wmt14en2de.sh - -The WMT English to German dataset can be preprocessed using the `prepare-wmt14en2de.sh` script. -By default it will produce a dataset that was modeled after ["Attention Is All You Need" (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762), but with news-commentary-v12 data from WMT'17. +### WMT'14 English to German (Convolutional) -To use only data available in WMT'14 or to replicate results obtained in the original ["Convolutional Sequence to Sequence Learning" (Gehring et al., 2017)](https://arxiv.org/abs/1705.03122) paper, please use the `--icml17` option. +The following instructions can be used to train a Convolutional translation model on the WMT English to German dataset. +See the [Scaling NMT README](../scaling_nmt/README.md) for instructions to train a Transformer translation model on this data. -```bash -bash prepare-wmt14en2de.sh --icml17 -``` +The WMT English to German dataset can be preprocessed using the `prepare-wmt14en2de.sh` script. +By default it will produce a dataset that was modeled after [Attention Is All You Need (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762), but with additional news-commentary-v12 data from WMT'17. -Example usage: +To use only data available in WMT'14 or to replicate results obtained in the original [Convolutional Sequence to Sequence Learning (Gehring et al., 2017)](https://arxiv.org/abs/1705.03122) paper, please use the `--icml17` option. ```bash +# Download and prepare the data cd examples/translation/ +# WMT'17 data: bash prepare-wmt14en2de.sh +# or to use WMT'14 data: +# bash prepare-wmt14en2de.sh --icml17 cd ../.. -# Binarize the dataset: +# Binarize the dataset TEXT=examples/translation/wmt17_en_de -fairseq-preprocess --source-lang en --target-lang de \ +fairseq-preprocess \ + --source-lang en --target-lang de \ --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ - --destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0 + --destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0 \ + --workers 20 -# Train the model: -# If it runs out of memory, try to set --max-tokens 1500 instead +# Train the model mkdir -p checkpoints/fconv_wmt_en_de -fairseq-train data-bin/wmt17_en_de \ +fairseq-train \ + data-bin/wmt17_en_de \ + --arch fconv_wmt_en_de \ --lr 0.5 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --lr-scheduler fixed --force-anneal 50 \ - --arch fconv_wmt_en_de --save-dir checkpoints/fconv_wmt_en_de + --save-dir checkpoints/fconv_wmt_en_de -# Generate: +# Evaluate fairseq-generate data-bin/wmt17_en_de \ - --path checkpoints/fconv_wmt_en_de/checkpoint_best.pt --beam 5 --remove-bpe + --path checkpoints/fconv_wmt_en_de/checkpoint_best.pt \ + --beam 5 --remove-bpe ``` -### prepare-wmt14en2fr.sh - -Provides an example of pre-processing for the WMT'14 English to French translation task. - -Example usage: - +### WMT'14 English to French ```bash +# Download and prepare the data cd examples/translation/ bash prepare-wmt14en2fr.sh cd ../.. -# Binarize the dataset: +# Binarize the dataset TEXT=examples/translation/wmt14_en_fr -fairseq-preprocess --source-lang en --target-lang fr \ +fairseq-preprocess \ + --source-lang en --target-lang fr \ --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ - --destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0 + --destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0 \ + --workers 60 -# Train the model: -# If it runs out of memory, try to set --max-tokens 1000 instead +# Train the model mkdir -p checkpoints/fconv_wmt_en_fr -fairseq-train data-bin/wmt14_en_fr \ +fairseq-train \ + data-bin/wmt14_en_fr \ --lr 0.5 --clip-norm 0.1 --dropout 0.1 --max-tokens 3000 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --lr-scheduler fixed --force-anneal 50 \ - --arch fconv_wmt_en_fr --save-dir checkpoints/fconv_wmt_en_fr - -# Generate: -fairseq-generate data-bin/fconv_wmt_en_fr \ - --path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt --beam 5 --remove-bpe + --arch fconv_wmt_en_fr \ + --save-dir checkpoints/fconv_wmt_en_fr + +# Evaluate +fairseq-generate \ + data-bin/fconv_wmt_en_fr \ + --path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt \ + --beam 5 --remove-bpe ``` ## Multilingual Translation @@ -253,7 +243,8 @@ grep ^H iwslt17.test.${SRC}-en.en.sys | cut -f3 \ | sacrebleu --test-set iwslt17 --language-pair ${SRC}-en ``` -### Argument format during inference +##### Argument format during inference + During inference it is required to specify a single `--source-lang` and `--target-lang`, which indicates the inference langauge direction. `--lang-pairs`, `--encoder-langtok`, `--decoder-langtok` have to be set to