Wonhee Jung ([email protected])
Kevin Mackie ([email protected])
Cindy Tseng ([email protected])
Conrad Harley ([email protected])
Online platforms allow people to express their opinions freely, and stimulate collaboration across the globe. Unfortunately, online interaction may often come with loosened inhibitions in making profane, bigoted, or offensive remarks. We refer to such unwelcome remarks as "toxic chat".
Online systems may or may not have their own embedded profanity filtering, and those that do typically use pre-registered terms and simple pattern matching. This approach lacks the deeper contextual understanding needed to identify sentences that are toxic but that may not contain banned terms.
Thus we propose a new toxic chat filtering system that differentiates itself in that a) its filtering is based on machine learning and deeper contextual analysis, and b) it is deployed as a scalable and easily integrated web framework that can be adapted to any source of text for online interaction of any size.
The platform will be based on Docker and Kubernetes for easy deployment and dependency management and to allow for fast scale-out to large systems. It will use state-of-the-art distributed systems technology for processing and storage, to allow for rapid scaling to any size while maintaining a shared file space (HDFS) between each Kubernetes Zone.
The framework will be documented in a final report that presents the architecture, development, and use of this system in the context of a web chat application and Twitch chatBot as motivating examples.
We will create a prototype of PaaS/SaaS service that provides the following specific capabilities:
- machine-learning-based toxic chat identification and filtering engine
- integerated web chat application or chatbot that uses the engine to analyze a real-time stream of text
The only publicly available toxic comment dataset we have been able to find so far is Kaggle's toxic comment classification challenge dataset, which we will use to train our classifier. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
The team will integrate the following technologies and solutions to provide a general framework that can scale to high volume/traffic in the future.
- Docker - to easily deploy, to multiple instances, the engine and application with all of their dependencies
- Kubernetes - to make deployment easy and allow for fast scale out
- CI/CD pipeline - to automate the build, integration, and deployment
- AWS, GCP, Heroku, etc - to deploy the solution into a mainstream PaaS infrastructure
- HDFS or similar - to store big data and share it between systems
- Scikit-learn or Apache Spark + MLlib - to train and deploy the classifier for toxic chat detection including scaling out the processing if sklearn is not sufficient for the volume of traffic expected
- RESTful APIs - to (possibly) provide access to the filtering engine for use by other applications