GitHub - simonsben/intent_detection: Intent detection in short texts

This is a set of work on intent detection, specifically focused on abusive intent. The work was done as part of my Master's at Queen's University under Professor Skillicorn. The abusive language detection is the continuation of work done by Hannah LeBlanc.

NOTE: Almost all of the data sets were made by someone else, their sources should all be tagged.

Setup

To setup the repo

Install Python 3.X (at this point <= 3.7 due to TensorFlow support)
Generate a virtual environment for the project [optional]
Install python dependencies with pip install -r requirements.txt
Install SpaCy model with python -m spacy download en_core_web_sm
Write accessor for any additional datasets (see accessors/ for info)

Usage

The usage of this work can be broken into several stages: data preparation, initial label generation, model training, model evaluation, and analysis.

Data preparation

To prepare for training and evaluation several things have to be pre-computed and configured.

Download and extract the wikipedia data with the script
Train a fastText model on a local dataset (see GitHub for info) [optional]
- Place the trained model into data/lexicons/fast_text/
Prepare the datasets for pre-processing by running their individual scripts or all at once using
Execute the pre-processing scipt
If you are planning on training the intent model
- Pull subset of wikipedia document with script
- Generate subset of storm-front data with the script
Specify your working dataset and fastText model in config.py
- If you do not have a config.py file already then start one using a copy of config_template.py.

Train abuse model

Ensure the hate speech dataset, kaggle, and insults are pre-processed
Run the bash script to combine them
Run the abuse training script

Train intent model

Add the source dataset to config.py
Run the rough label generation script
Extract the verbs from the intent frames and compute their embeddings with collection script
Refine rough labels with script
Compute sequence-context matrix with script
Train the model with training script

Make predictions

Now that you have a trained abuse and intent model predictions can be made for any target dataset of interest. This can be done by simply specifying the name of the target dataset in the config file and executing the prediction script. This will make and save a prediction for each document in the targeted corpus to data/processed_data/[dataset_name]/analysis/intent_abuse/

Analysis

The analysis scripts are under execution and should be named/placed intuitively corresponding to how they're referred to in the thesis.

Note on cleanliness

Most of the outdated files should have been removed by im sure unused functions and files are still here and there, ignore them.

Name		Name	Last commit message	Last commit date
Latest commit History 320 Commits
data		data
execution		execution
model		model
notes		notes
utilities		utilities
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
banner.png		banner.png
config_template.py		config_template.py
requirements.txt		requirements.txt
usage.md		usage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Usage

Data preparation

Train abuse model

Train intent model

Make predictions

Analysis

Note on cleanliness

About

Releases

Packages

Languages

simonsben/intent_detection

Folders and files

Latest commit

History

Repository files navigation

Setup

Usage

Data preparation

Train abuse model

Train intent model

Make predictions

Analysis

Note on cleanliness

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages