Skip to content

Latest commit

 

History

History

pegasus

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Porting pegasus evaluation datasets

transformers has ported the pegasus project. For usage please see: Pegasus.

This sub-repo contains links to evaluation datasets and also build scripts that built that data.

Datasets

Datasets that we managed to build successfully and links to s3:

dataset 3 splits test split
aeslc full test-only
arxiv full test-only
billsum full test-only
cnn_dailymail full test-only
gigaword full test-only
multi_news full test-only
newsroom full test-only
pubmed full test-only
reddit_tifu full test-only
wikihow full test-only
xsum full test-only

Each full archive includes the following files:

test.source
test.target
train.source
train.target
validation.source
validation.target

Each test-only archive includes just:

test.source
test.target

Datasets that we couldn't figure out:

For history purposes, here is the issue where the process has been discussed.

Building data from scratch

The following notes explain how to build the evaluation datasets from scratch.

Currently the datasets are pulled from either datasets or tfds or tfds_transformed.

For each dataset you will find a folder with process.txt that includes instructions on how to build it.

The top-level process-all.py that builds most of them at once will only work once each was built via its folder's process.txt. This is because many of the datasets require a one-time manual download/tinkering.

Most build scripts use pegasus which takes a bit of tinkering to install:

git clone https://github.com/google-research/pegasus
cd pegasus
pip install pygame==2.0.0.dev12
perl -pi -e 's|tensorflow-text==1.15.0rc0|tensorflow-text|; s|tensor2tensor==1.15.0|tensor2tensor|; s|tensorflow-gpu==1.15.2|tensorflow-gpu|' requirements.txt setup.py
pip install -r requirements.txt
pip install -e .

Then you will also need:

pip install tensorflow_datasets -U
pip install datasets

Evaluation

Each sub-folder's process.txt contains the command to run the evaluation for that particular dataset.

It assumes you have already installed transformers with its prerequisites:

git clone https://github.com/huggingface/transformers/
cd transformers
pip install -e .[dev]
pip install -r examples/requirements.txt    

And finally:

cd ./examples/seq2seq

as that's where the eval scripts are located.

see README.md inside examples/seq2seq for additional information about eval scripts.

Of course, if you haven't been using the scripts to build from scratch, you will need to download and untar the evaluation dataset before you can eval.

Problems

If you encounter any problems with building eval data, please create an issue here. If you have any issue with outcomes this is an issue for transformers.

If you manage to figure out how to build big_patent, see this issue that would be amazing! Thank you!

Authors

This area is a collaboration of sshleifer, patil-suraj and stas00.