This code is a re-implementation of the paper Metric Learning for Dynamic Text Classification by ASAPP Research using Catalyst framework.
The original code for the paper can be found here asappresearch/dynamic-classification.
-
Clone repository
git clone [email protected]:xelibrion/catalyst-dynamic-text-classification.git cd catalyst-dynamic-text-classification
-
Install dependencies
pip install -e .
-
Fetch data
cd dynamic_class ./get_data.py
-
Run train script to build vocabulary (it will fail to train the model without embeddings)
./train.py
-
Compute words vectors for the vocabulary using a fasttext model. Can be downloaded here.
cat input/vocab.txt | awk -F ' ' '{print $1}' > vocab_words.txt ~/projects/fasttext/fasttext print-word-vectors ~/projects/fasttext/cc.en.300.bin < vocab_words.txt > vocab_vectors.txt
Please note that the original paper used GloVe as word embeddings. You might want to experiment with the choice of embeddings.
Also, the tokenizer could be much better - at the moment it simply splits on whitespace.
-
Train the model
./train.py
This pipeline uses sru package, which might cause some challenges to get things running. See my comment here.