The Toxic Comment Classification project is an application that uses machine learning to identify toxic comments. The application uses a dataset of comments from social media platforms, such as Twitter, to train a model that can detect toxic comments. The goal of the project is to develop a model that can accurately classify toxic comments and help moderators filter out comments that violate community guidelines.
.
├── app_exception # Custom exception
├── application_logging # logging
├── data_given # given Data
├── data # raw / processed/transformed data
├── saved_models # classification model
├── report # model parameter and pipeline reports.
├── src # Source files for project implementation
├── webapp # ml web application
├── dvc.yaml # data version control pipeline.
├── app.py # gradio app
├── param.yaml # parameters
├── requirements.txt # Dependesis for the project
└── README.md
The dataset used in this project is the Toxic Comment Classification Challenge from Kaggle. The dataset contains approximately 159,000 comments from Wikipedia talk pages that have been labeled by human annotators as toxic or non-toxic. The dataset includes six different types of toxicity: toxic, severe toxic, obscene, threat, insult, and identity hate. The dataset is split into a training set and a testing set, with approximately 80% of the comments in the training set and 20% in the testing set.
A baseline was created using the RNN model. An embedding layer of size 64 was used. Training the model with an Adam optimizer with a learning rate of 0.001 for 10 epochs yielded an Accuracy of 83.68% and an ROC-AUC Score of 52.03%.
Contributions to this project are welcome! To contribute, please follow the standard GitHub workflow for pull requests.
If you have any questions or comments about this project, feel free to contact the project maintainer at gmail
MIT License