Finetuning RoBERTa on Winograd Schema Challenge (WSC) data

The following instructions can be used to finetune RoBERTa on the WSC training data provided by SuperGLUE.

Note that there is high variance in the results. For our GLUE/SuperGLUE submission we swept over the learning rate, batch size and total number of updates, as well as the random seed. Out of ~100 runs we chose the best 7 models and ensembled them.

Note: The instructions below use a slightly different loss function than what's described in the original RoBERTa arXiv paper. In particular, Kocijan et al. (2019) introduce a margin ranking loss between (query, candidate) pairs with tunable hyperparameters alpha and beta. This is supported in our code as well with the --wsc-alpha and --wsc-beta arguments. However, we achieved slightly better (and more robust) results on the development set by instead using a single cross entropy loss term over the log-probabilities for the query and all candidates. This reduces the number of hyperparameters and our best model achieved 92.3% development set accuracy, compared to ~90% accuracy for the margin loss. Later versions of the RoBERTa arXiv paper will describe this updated formulation.

1) Download the WSC data from the SuperGLUE website:

wget https://dl.fbaipublicfiles.com/glue/superglue/data/v2/WSC.zip
unzip WSC.zip

# we also need to copy the RoBERTa dictionary into the same directory
wget -O WSC/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt

2) Finetune over the provided training data:

TOTAL_NUM_UPDATES=2000  # Total number of training steps.
WARMUP_UPDATES=250      # Linearly increase LR over this many steps.
LR=2e-05                # Peak LR for polynomial LR scheduler.
MAX_SENTENCES=16        # Batch size per GPU.
SEED=1                  # Random seed.
ROBERTA_PATH=/path/to/roberta/model.pt

# we use the --user-dir option to load the task and criterion
# from the examples/roberta/wsc directory:
FAIRSEQ_PATH=/path/to/fairseq
FAIRSEQ_USER_DIR=${FAIRSEQ_PATH}/examples/roberta/wsc

cd fairseq
CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train WSC/ \
  --restore-file $ROBERTA_PATH \
  --reset-optimizer --reset-dataloader --reset-meters \
  --no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state \
  --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
  --valid-subset val \
  --fp16 --ddp-backend no_c10d \
  --user-dir $FAIRSEQ_USER_DIR \
  --task wsc --criterion wsc --wsc-cross-entropy \
  --arch roberta_large --bpe gpt2 --max-positions 512 \
  --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 \
  --lr-scheduler polynomial_decay --lr $LR \
  --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_NUM_UPDATES \
  --max-sentences $MAX_SENTENCES \
  --max-update $TOTAL_NUM_UPDATES \
  --log-format simple --log-interval 100

The above command assumes training on 4 GPUs, but you can achieve the same results on a single GPU by adding --update-freq=4.

3) Evaluate

from fairseq.models.roberta import RobertaModel
from examples.roberta.wsc import wsc_utils  # also loads WSC task and criterion
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'WSC/')
roberta.cuda()
nsamples, ncorrect = 0, 0
for sentence, label in wsc_utils.jsonl_iterator('WSC/val.jsonl', eval=True):
    pred = roberta.disambiguate_pronoun(sentence)
    nsamples += 1
    if pred == label:
        ncorrect += 1
print('Accuracy: ' + str(ncorrect / float(nsamples)))
# Accuracy: 0.9230769230769231

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.wsc.md

README.wsc.md

Finetuning RoBERTa on Winograd Schema Challenge (WSC) data

1) Download the WSC data from the SuperGLUE website:

2) Finetune over the provided training data:

3) Evaluate

Files

README.wsc.md

Latest commit

History

README.wsc.md

File metadata and controls

Finetuning RoBERTa on Winograd Schema Challenge (WSC) data

1) Download the WSC data from the SuperGLUE website:

2) Finetune over the provided training data:

3) Evaluate