transformers
has ported the pegasus
project. For usage please see: Pegasus.
This sub-repo contains links to evaluation datasets and also build scripts that built that data.
Datasets that we managed to build successfully and links to s3:
dataset | 3 splits | test split |
---|---|---|
aeslc | full | test-only |
arxiv | full | test-only |
billsum | full | test-only |
cnn_dailymail | full | test-only |
gigaword | full | test-only |
multi_news | full | test-only |
newsroom | full | test-only |
pubmed | full | test-only |
reddit_tifu | full | test-only |
wikihow | full | test-only |
xsum | full | test-only |
Each full archive includes the following files:
test.source
test.target
train.source
train.target
validation.source
validation.target
Each test-only archive includes just:
test.source
test.target
Datasets that we couldn't figure out:
- big_patent - we couldn't build this arrow dataset, see google-research/pegasus#114 - if you can help to build this it'd be amazing!
For history purposes, here is the issue where the process has been discussed.
The following notes explain how to build the evaluation datasets from scratch.
Currently the datasets are pulled from either datasets
or tfds
or tfds_transformed
.
For each dataset you will find a folder with process.txt
that includes instructions on how to build it.
The top-level process-all.py
that builds most of them at once will only work once each was built via its folder's process.txt
. This is because many of the datasets require a one-time manual download/tinkering.
Most build scripts use pegasus
which takes a bit of tinkering to install:
git clone https://github.com/google-research/pegasus
cd pegasus
pip install pygame==2.0.0.dev12
perl -pi -e 's|tensorflow-text==1.15.0rc0|tensorflow-text|; s|tensor2tensor==1.15.0|tensor2tensor|; s|tensorflow-gpu==1.15.2|tensorflow-gpu|' requirements.txt setup.py
pip install -r requirements.txt
pip install -e .
Then you will also need:
pip install tensorflow_datasets -U
pip install datasets
Each sub-folder's process.txt
contains the command to run the evaluation for that particular dataset.
It assumes you have already installed transformers
with its prerequisites:
git clone https://github.com/huggingface/transformers/
cd transformers
pip install -e .[dev]
pip install -r examples/requirements.txt
And finally:
cd ./examples/seq2seq
as that's where the eval scripts are located.
see README.md
inside examples/seq2seq
for additional information about eval scripts.
Of course, if you haven't been using the scripts to build from scratch, you will need to download and untar the evaluation dataset before you can eval.
If you encounter any problems with building eval data, please create an issue here. If you have any issue with outcomes this is an issue for transformers
.
If you manage to figure out how to build big_patent
, see this issue that would be amazing! Thank you!
This area is a collaboration of sshleifer, patil-suraj and stas00.