Skip to content

Commit

Permalink
Update READMEs
Browse files Browse the repository at this point in the history
Summary: Pull Request resolved: fairinternal/fairseq-py#823

Differential Revision: D16804995

Pulled By: myleott

fbshipit-source-id: abac5dc0ed6b7bfe2309ba273456e54b37340b2c
  • Loading branch information
Myle Ott authored and facebook-github-bot committed Aug 14, 2019
1 parent ffffe04 commit b870468
Show file tree
Hide file tree
Showing 8 changed files with 200 additions and 89 deletions.
39 changes: 18 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ modeling and other text generation tasks.

### What's New:

- August 2019: [WMT'19 models released](examples/wmt19/README.md)
- July 2019: fairseq relicensed under MIT license
- July 2019: [RoBERTa models and code release](examples/roberta/README.md)
- June 2019: [wav2vec models and code release](examples/wav2vec/README.md)
- April 2019: [fairseq demo paper @ NAACL 2019](https://arxiv.org/abs/1904.01038)
- July 2019: [RoBERTa models and code released](examples/roberta/README.md)
- June 2019: [wav2vec models and code released](examples/wav2vec/README.md)

### Features:

Expand All @@ -31,6 +31,7 @@ Fairseq provides reference implementations of various sequence-to-sequence model
- [Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)](examples/language_model/transformer_lm/README.md)
- [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md)
- [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md)

**Additionally:**
- multi-GPU (distributed) training on one machine or across multiple machines
Expand All @@ -49,38 +50,33 @@ translation and language modeling datasets.

# Requirements and Installation

* [PyTorch](http://pytorch.org/) version >= 1.0.0
* [PyTorch](http://pytorch.org/) version >= 1.1.0
* Python version >= 3.5
* For training new models, you'll also need an NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)
* **For faster training** install NVIDIA's [apex](https://github.com/NVIDIA/apex) library with the `--cuda_ext` option

Please follow the instructions here to install PyTorch: https://github.com/pytorch/pytorch#installation.

If you use Docker make sure to increase the shared memory size either with
`--ipc=host` or `--shm-size` as command line options to `nvidia-docker run`.

After PyTorch is installed, you can install fairseq with `pip`:
```
To install fairseq:
```bash
pip install fairseq
```
On MacOS,
```

On MacOS:
```bash
CFLAGS="-stdlib=libc++" pip install fairseq
```

If you use Docker make sure to increase the shared memory size either with
`--ipc=host` or `--shm-size` as command line options to `nvidia-docker run`.

**Installing from source**

To install fairseq from source and develop locally:
```
```bash
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable .
```

**Improved training speed**

Training speed can be further improved by installing NVIDIA's
[apex](https://github.com/NVIDIA/apex) library with the `--cuda_ext` option.
fairseq will automatically switch to the faster modules provided by apex.

# Getting Started

The [full documentation](https://fairseq.readthedocs.io/) contains instructions
Expand All @@ -93,9 +89,10 @@ We provide pre-trained models and pre-processed, binarized test sets for several
as well as example training and evaluation commands.

- [Translation](examples/translation/README.md): convolutional and transformer models are available
- [Language Modeling](examples/language_model/README.md): convolutional models are available
- [Language Modeling](examples/language_model/README.md): convolutional and transformer models are available

We also have more detailed READMEs to reproduce results from specific papers:
- [Facebook FAIR's WMT19 News Translation Task Submission (Ng et al., 2019)](examples/wmt19/README.md)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019)](examples/roberta/README.md)
- [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](examples/wav2vec/README.md)
- [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](examples/translation_moe/README.md)
Expand Down
73 changes: 36 additions & 37 deletions examples/language_model/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,58 +27,57 @@ en_lm.sample('Barack Obama', beam=1, sampling=True, sampling_topk=10, temperatur
# "Barack Obama is coming to Sydney and New Zealand (...)"
```

## Training a new model with the CLI tools
## Training a transformer language model with the CLI tools

These scripts provide an example of pre-processing data for the Language Modeling task.
### 1) Preprocess the data

### prepare-wikitext-103.sh

Provides an example of pre-processing for [WikiText-103 language modeling task](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):

Example usage:

Prepare data:
First download and prepare the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):
```bash
cd examples/language_model/
bash prepare-wikitext-103.sh
cd ../..
```

# Binarize the dataset:
Next preprocess/binarize the data:
```bash
TEXT=examples/language_model/wikitext-103

fairseq-preprocess --only-source \
--trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103
fairseq-preprocess \
--only-source \
--trainpref $TEXT/wiki.train.tokens \
--validpref $TEXT/wiki.valid.tokens \
--testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103 \
--workers 20
```

Train a transformer language model with adaptive inputs ([Baevski and Auli (2018): Adaptive Input Representations for Neural Language Modeling](transformer_lm/README.md)):
### 2) Train a language model

Next we'll train a transformer language model using [adaptive inputs](transformer_lm/README.md):
```bash
# If it runs out of memory, try to reduce max-tokens and tokens-per-sample
mkdir -p checkpoints/transformer_wikitext-103
fairseq-train --task language_modeling data-bin/wikitext-103 \
--save-dir checkpoints/transformer_wikitext-103 --arch transformer_lm_wiki103 \
fairseq-train --task language_modeling \
data-bin/wikitext-103 \
--save-dir checkpoints/transformer_wikitext-103 \
--arch transformer_lm_wiki103 \
--max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
--warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
--criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
--sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d

# Evaluate:
fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/transformer_wiki103/checkpoint_best.pt' \
--sample-break-mode complete --max-tokens 3072 --context-window 2560 --softmax-batch 1024
```

Train a convolutional language model ([Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](conv_lm/README.md)):
```
# If it runs out of memory, try to reduce max-tokens and tokens-per-sample
mkdir -p checkpoints/fconv_wikitext-103
fairseq-train --task language_modeling data-bin/wikitext-103 \
--save-dir checkpoints/fconv_wikitext-103 \
--max-epoch 35 --arch fconv_lm_dauphin_wikitext103 --optimizer nag \
--lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
--clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
--adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
--ddp-backend=no_c10d
# Evaluate:
fairseq-eval-lm data-bin/wikitext-103 --path 'checkpoints/fconv_wiki103/checkpoint_best.pt'
If the above command runs out of memory, try reducing `--max-tokens` (max number
of tokens per batch) or `--tokens-per-sample` (max sequence length). You can
also increase `--update-freq` to accumulate gradients and simulate training on
more GPUs.

### 3) Evaluate
```bash
fairseq-eval-lm data-bin/wikitext-103 \
--path checkpoints/transformer_wiki103/checkpoint_best.pt \
--sample-break-mode complete --max-tokens 3072 \
--context-window 2560 --softmax-batch 1024
```

## Convolutional language models

Please see the [convolutional LM README](conv_lm/README.md) for instructions to
train convolutional language models.
23 changes: 21 additions & 2 deletions examples/language_model/conv_lm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,27 @@

## Example usage

See the [language modeling README](../README.md) for instructions on reproducing results for WikiText-103
using the `fconv_lm_dauphin_wikitext103` model architecture.
First download and preprocess the data following the main [language modeling
README](../README.md).

Then to train a convolutional LM using the `fconv_lm_dauphin_wikitext103`
architecture:
```bash
fairseq-train --task language_modeling \
data-bin/wikitext-103 \
--save-dir checkpoints/fconv_wikitext-103 \
--arch fconv_lm_dauphin_wikitext103 \
--max-epoch 35 \ --optimizer nag \
--lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
--clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
--adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --tokens-per-sample 1024 \
--ddp-backend=no_c10d
```

And evaluate with:
```bash
fairseq-eval-lm data-bin/wikitext-103 --path checkpoints/fconv_wiki103/checkpoint_best.pt
```

## Citation

Expand Down
2 changes: 1 addition & 1 deletion examples/language_model/transformer_lm/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Adaptive Input Representations for Neural Language Modeling (Baevski and Auli; 2018)
# Adaptive Input Representations for Neural Language Modeling (Baevski and Auli, 2018)

## Pre-trained models

Expand Down
4 changes: 2 additions & 2 deletions examples/roberta/README.cqa.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ representations through a fully-connected layer to predict the correct answer.
We train with a standard cross-entropy loss.

We also found it helpful to prepend a prefix of `Q:` to the question and `A:` to
the input. The complete input format is:
the answer. The complete input format is:
```
<s> Q: Where would I not want a fox? </s> A: hen house </s>
```
Expand All @@ -18,7 +18,7 @@ Our final submission is based on a hyperparameter search over the learning rate
4000) and random seed. We selected the model with the best performance on the
development set after 100 trials.

### 1) Download the data from Commonsense QA website (https://www.tau-nlp.org/commonsenseqa)
### 1) Download data from the Commonsense QA website (https://www.tau-nlp.org/commonsenseqa)
```bash
bash examples/roberta/commonsense_qa/download_cqa_data.sh
```
Expand Down
35 changes: 17 additions & 18 deletions examples/roberta/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,24 @@

https://arxiv.org/abs/1907.11692

## Introduction
### Introduction

**RoBERTa** iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.
RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. See the associated paper for more details.

## Pre-trained models
### What's New:

- August 2019: Added [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).

### Pre-trained models

Model | Description | # params | Download
---|---|---|---
`roberta.base` | RoBERTa using the BERT-base architecture | 125M | [roberta.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.base.tar.gz)
`roberta.large` | RoBERTa using the BERT-large architecture | 355M | [roberta.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz)
`roberta.large.mnli` | `roberta.large` finetuned on [MNLI](http://www.nyu.edu/projects/bowman/multinli) | 355M | [roberta.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.mnli.tar.gz)
`roberta.large.wsc` | `roberta.large` finetuned on [WSC](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)
`roberta.large.wsc` | `roberta.large` finetuned on [WSC](README.wsc.md) | 355M | [roberta.large.wsc.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.wsc.tar.gz)

## Results
### Results

##### Results on GLUE tasks (dev set, single model, single-task finetuning)

Expand Down Expand Up @@ -44,7 +48,7 @@ Model | Accuracy | Middle | High
---|---|---|---
`roberta.large` | 83.2 | 86.5 | 81.3

## Example usage
### Example usage

##### Load RoBERTa from torch.hub (PyTorch >= 1.1):
```python
Expand All @@ -53,15 +57,15 @@ roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval() # disable dropout (or leave in train mode to finetune)
```

##### Load RoBERTa (for PyTorch 1.0):
##### Load RoBERTa (for PyTorch 1.0 or custom models):
```python
# Download roberta.large model
wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
tar -xzvf roberta.large.tar.gz

# Load the model in fairseq
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('/path/to/roberta.large')
roberta = RobertaModel.from_pretrained('/path/to/roberta.large', checkpoint_file='model.pt')
roberta.eval() # disable dropout (or leave in train mode to finetune)
```

Expand Down Expand Up @@ -120,7 +124,7 @@ roberta.cuda()
roberta.predict('new_task', tokens) # tensor([[-1.1050, -1.0672, -1.1245]], device='cuda:0', grad_fn=<LogSoftmaxBackward>)
```

## Advanced usage
### Advanced usage

#### Filling masks:

Expand Down Expand Up @@ -212,24 +216,19 @@ print('| Accuracy: ', float(ncorrect)/float(nsamples))
# Expected output: 0.9060
```


## Finetuning
### Finetuning

- [Finetuning on GLUE](README.glue.md)
- [Finetuning on custom classification tasks (e.g., IMDB)](README.custom_classification.md)
- [Finetuning on Winograd Schema Challenge (WSC)](README.wsc.md)
- [Finetuning on Commonsense QA (CQA)](README.cqa.md)
- Finetuning on SQuAD: coming soon

## Pretraining using your own data

You can use the [`masked_lm` task](/fairseq/tasks/masked_lm.py) to pretrain RoBERTa from scratch, or to continue pretraining RoBERTa starting from one of the released checkpoints.

Data should be preprocessed following the [language modeling example](/examples/language_model).
### Pretraining using your own data

A more detailed tutorial is coming soon.
See the [tutorial for pretraining RoBERTa using your own data](README.pretraining.md).

## Citation
### Citation

```bibtex
@article{liu2019roberta,
Expand Down
Loading

0 comments on commit b870468

Please sign in to comment.