Skip to content

Latest commit

 

History

History
161 lines (124 loc) · 7.73 KB

README.md

File metadata and controls

161 lines (124 loc) · 7.73 KB

XL-WSD

Code for the paper XL-WSD: An Extra-Large and Cross-Lingual Evaluation Frameworkfor Word Sense Disambiguation. Please visit https://sapienzanlp.github.io/xl-wsd/ for more info and to download the data. Pretrained models are available at the bottom of this page.

Install

First setup the python environment. Be sure that anaconda is already installed.

git clone https://github.com/SapienzaNLP/xl-wsd-code.git
conda create --name xl-wsd-code python=3.7
conda activate xl-wsd-code
cd xl-wsd-code
pip install -r requirements.txt
conda install pytorch==1.5.0 torchtext==0.6.0 cudatoolkit=10.1 -c pytorch

Then, download and install WordNet.

cd /tmp
wget http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz
tar xvzf WordNet-3.0.tar.gz
sudo mv WordNet-3.0 /opt/
rm WordNet-3.0.tar.gz

In case you do not want to move WordNet to the /opt/ directory, then set the WORDNET_PATH variable in src/datasets/__init__.py to the full path of your WordNet-3.0 directory.

Train

Setup data

wget --header="Host: doc-04-b8-docs.googleusercontent.com" --header="User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-GB,en-US;q=0.9,en;q=0.8" --header="Referer: https://drive.google.com/" --header="Cookie: AUTH_bjalsfn9vp89mfmro6spe8un3che13a6_nonce=nlgjg7mf6jn9e" --header="Connection: keep-alive" "https://doc-04-b8-docs.googleusercontent.com/docs/securesc/qpect75hpbjc0ojmotm96i6g1v6ev8i1/ssjeem6krjiq45h1lb2t3k0t4uh11fea/1617902700000/13518213284567006193/13518213284567006193/19YTL-Uq95hjiFZfgwEpXRgcYGCR_PQY0?e=download&authuser=1&nonce=nlgjg7mf6jn9e&user=13518213284567006193&hash=j6hh86p5arl35lijpmf4oak2hnhvrr20" -c -O 'xl-wsd-data.zip'
tar xvzf xl-wsd-data.tar.gz

if wget does not work, download the data from https://drive.google.com/file/d/19YTL-Uq95hjiFZfgwEpXRgcYGCR_PQY0/view?usp=sharing

Setup config

open config/config_en_semcor_wngt.train.yaml with your favourite text editor. edit the paths for inventory_dir, test_data_root and train_data_root. train_data_root is a dictionary from language code, e.g., en, to a list of paths. Each path can be a dataset in the standard format of this framework. test_names is also a dictionary from language code to a list of names of test set folders that can be found in test_data_root.

For example, considering the following directories that are in the evaluation_datasets dir,

ls xl-wsd-dataset/evaluation_datasets
test-en
test-en-coarse
dev-en
...
test-zh
dev-zh

one can set the test_names variable in the config as:

en:
    - test-en
    - test-en-coarse
it:
    - test-it
zh:
    - test-zh

Evaluation on each test set is perfromed at the end of each training epoch. The dev_name variable in the config, instead, is a pair (language, devset name). It can be set as follows:

dev_name:
    - en
    - dev-en

outpath is the path to a directory where a new folder for the newly trained model can be created and where the checkpoints and information about the model will be stored.

encoder_name may be any transformer model supported by allen nlp 1.0.

model_name may be any name you would like to give to the model.

All the other options can also be changed and are pretty self-explainatory.

Running the training

cd xl-wsd-code
PYTHONPATH=. python src/training/wsd_trainer.py --config config/config_en_semcor_wngt.train.yaml

The wsd_trainer.py script takes also the following parameters as input:

--dryrun | --no-wandb-log which disables the logging to wandb.

--no_checkpoint which disables the saving of checkpoints.

--reload_checkopint which allows the program to reload weights that were previously saved in the same directory, i.e., outpath/checkpoints.

--cpu to run the script in CPU.

Some other parameters are also allowed and would overwrite those in the config if specified:

--weight_decay sets the weight decay.

--learning_rate sets the learning rate.

--gradient_clipping sets the gradient clipping threshold.

During training, the folder checkpoints is created within outpath an checkpoints are saved thereing.

At the end of training the best model found (with the lowest loss on the dev) is reloaded and tested on the test set defined in test_names and in outpath/evaluation/ a file for each test set is created with the predictions. The files contain a row for each test-set instance with the id and the predicted synset separated by space.

Evaluate

To evaluate a saved model, it will be enough to run this command

PYTHONPATH=. python src/evaluation/evaluate_model.py --config config/config_en_semcor_wngt.test.yaml

where config_en_semcor_wngt.test.yaml is a configuration file similar to config_en_semcor_wngt.train.yaml with all the test set one is interested in specified in the test_names field. The best.th set of weights within outpath/checkpoints/ will be evaluated and the results for each dataset printed at the console. The predictions will be saved in outpath/evaluation. the evaluate_model.py takes also the following parameters that would, in case, override those in the config file: --checkpoint_path containing the path to a specific checkpoint.

--output_path to specifiy a different path where to store the predictions.

--pos a list containing all the POS tags on which one wants to perform a separate evaluation. POS tags allowed are {n,v,r,a} where n = noun, v = verb, r = adverb and a = adjective.

--verbose to make the script print more info.

--debug to print debug files.

--cpu to run the evaluation in CPU rather then on GPU.

For example,

PYTHONPATH=. python src/evaluation/evaluate_model.py --config config/config_en_semcor_wngt.test.yaml --pos n

would print only the results on the test sets computed on nominal instances only.

PYTHONPATH=. python src/evaluation/evaluate_model.py --config config/config_en_semcor_wngt.test.yaml --pos n v

would instead print results computed separately on nouns and verbs.

Pretrained Models

Encoder Training Data Link
XLM-Roberta Large SemCor+WNGT link
XLM-Roberta Base SemCor+WNGT link
Multilingual BERT SemCor+WNGT link
ALL SemCor+WNGT link

License

This project is released under the CC-BY-NC 4.0 license (see LICENSE). If you use the code in this repo, please link it.

Acknowledgements

The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union's Horizon 2020 research and innovation programme.

The authors gratefully acknowledge the support of the ERC Consolidator Grant FoTran No. 771113 under the European Union's Horizon 2020 research and innovation programme.

The authors gratefully acknowledge the support of the ELEXIS project No. 731015 under the European Union's Horizon 2020 research and innovation programme.

Authors also thank the CSC - IT Center for Science (Finland)for the computational resources.