This is a set of work on intent detection, specifically focused on abusive intent. The work was done as part of my Master's at Queen's University under Professor Skillicorn. The abusive language detection is the continuation of work done by Hannah LeBlanc.
NOTE: Almost all of the data sets were made by someone else, their sources should all be tagged.
To setup the repo
- Install Python 3.X (at this point <= 3.7 due to TensorFlow support)
- Generate a virtual environment for the project [optional]
- Install python dependencies with
pip install -r requirements.txt
- Install SpaCy model with
python -m spacy download en_core_web_sm
- Write accessor for any additional datasets (see
accessors/
for info)
The usage of this work can be broken into several stages: data preparation, initial label generation, model training, model evaluation, and analysis.
To prepare for training and evaluation several things have to be pre-computed and configured.
- Download and extract the wikipedia data with the script
- Train a fastText model on a local dataset (see GitHub for info) [optional]
- Place the trained model into
data/lexicons/fast_text/
- Place the trained model into
- Prepare the datasets for pre-processing by running their individual scripts or all at once using
- Execute the pre-processing scipt
- If you are planning on training the intent model
- Specify your working dataset and fastText model in
config.py
- If you do not have a
config.py
file already then start one using a copy ofconfig_template.py
.
- If you do not have a
- Ensure the hate speech dataset, kaggle, and insults are pre-processed
- Run the bash script to combine them
- Run the abuse training script
- Add the source dataset to
config.py
- Run the rough label generation script
- Extract the verbs from the intent frames and compute their embeddings with collection script
- Refine rough labels with script
- Compute sequence-context matrix with script
- Train the model with training script
Now that you have a trained abuse and intent model predictions can be made for any target dataset of interest.
This can be done by simply specifying the name of the target dataset in the config file and executing the prediction script.
This will make and save a prediction for each document in the targeted corpus to data/processed_data/[dataset_name]/analysis/intent_abuse/
The analysis scripts are under execution and should be named/placed intuitively corresponding to how they're referred to in the thesis.
Most of the outdated files should have been removed by im sure unused functions and files are still here and there, ignore them.