- Register for BioASQ here: http://participants-area.bioasq.org/accounts/register/
- Tick task A from the list of Tasks at the bottom of the registration page.
- Navigate to the datasets: http://participants-area.bioasq.org/datasets/
- Download
Training v.2021
- the txt version which givesallMeSH_2021.zip
. - Unzip it and place the file in the
experiments/bioasq/data
folder.
In what follows, we have provided scripts to recreate the dataset.
Each abstract has a unique key, its pmid. We recreate the splits we created for the paper by filtering the downloaded dataset according to the pmids.
For more information on how we created the original splits - e.g. got the pmids, see the appendix in the paper and the scripts under bin/preprocess
.
The script below assumes you have placed allMeSH_2021.json
in the experiments/bioasq/data
folder.
pip install awscli
mkdir -p data/subsets-v-20000
aws s3 cp s3://sigmoid-bottleneck/bioasq/data/train-100k-part-1.csv --no-sign-request data/subsets-v-20000
aws s3 cp s3://sigmoid-bottleneck/bioasq/data/valid-5k.csv --no-sign-request data/subsets-v-20000
aws s3 cp s3://sigmoid-bottleneck/bioasq/data/test-10k.csv --no-sign-request data/subsets-v-20000
aws s3 cp s3://sigmoid-bottleneck/bioasq/data/vocab.txt --no-sign-request data/subsets-v-20000
python construct_dataset.py --data allMeSH_2021.json
The script should take approximately 5 minutes to run and it will create 3 json files, so the directory structure should look like:
.
├── allMeSH_2021.json
├── construct_dataset.py
└── subsets-v-20000
├── test-10k.csv
├── train-100k-part-1.csv
├── valid-5k.csv
├── test-10k.json
├── train-100k-part-1.json
├── valid-5k.json
└── vocab.txt
You can use the scripts run-bsl.sh
and run-dft.sh
to train the BSL and DFT models, correspondingly.
Set the SEED
environment variable to change the random state.
export SEED=0
mkdir -p logs
./run-bsl.sh
./run-dft.sh
Each experiment is written to the experiment folder.
# Verification via the LP is parallelisable (the larger you can afford to make NUM_PROC, the better)
export MLBL_NUM_PROC=10
./eval.sh