Final Project: Text Classification on Toxic Comments

UC Berkeley MIDS Program

Shuo Wang, Ci Song

Spring 2024

Our project is heavily isnpired by the Jigsaw Toxic Comment Classification challenge on Kaggle.

Final Project: Text Classification on Toxic Comments

Project Overview

In this research project, a banch of traditional machines learnings models and advanced Natural Language Processing (NLP) transformer models have been utilized to do the toxic comment classificaiton. Through meticulous experimentation on the Jigsaw Toxic Comment Classification dataset, it was revealed that fine-tuned transformer-based models not only substantially improved accuracy, precision, recall, F1 score, and ROC AUC metrics but also demonstrated RoBERTa and DistilBERT’s slight superiority across nearly all metrics. A model evaluation and metrics summary table is at the end of the project.

Dataset

Models

Baseline Models
- CountVectorizer - Complement Naive Bayes (CNB)
- CountVectorizer - Multinomial Naive Bayes (MNB)
- TfidfVectorizer - CNB
- TfidfVectorizer - MNB
Transformer Models
- BART
- BERT
- BERT+CNN
- DistilBERT
- DistilBERT+CNN
- ALBERT
- RoBERTa
- Bidirectional_GRU
Models
- Deep Averaging Network (DAN)
  - DAN-Static
  - DAN_Retrain_word2vec
  - DAN-REtrain_uniform
- Weighted Averaging Networks (WANs)
- Logestic Regression
- CNN
  - CNN-non_Retrain
  - CNN-Retrain
- RNN
  - RNN-non_Retrain
  - RNN-Retrain
- CNN_RNN
  - CNN_RNN-non_Retrain
  - CNN_RNN-Retrain

Experienment Results

Model Evaluation and Metrics Summary

Future Work

Multi-Class Text Classification.

The current data is highly imbalanced, a potential solution is to build a two-step models.
Parallel Computing Technique

Utilizing parallel computing technique to work for the advanced NLP algorithms, including RoBERTa-LONG, T5, and XLNET. These 3 models resulted in Resource Exhausted Errors with the limited GPU capacity in Google Colab.

Helpful Information

Environment

A series of notebooks and scripts are run on Google Colaboratory Pro using GPU environment.

Transformer Version

Hugging Face updated the Transformers library, to run our code (like BERT) we need to revert to an earlier version of the Transformers library.

!pip install -q transformers==4.37.2

Back-To-Top

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Images		Images
Model		Model
datasets		datasets
.DS_Store		.DS_Store
Final Paper_Text Classification on Toxic Comments.pdf		Final Paper_Text Classification on Toxic Comments.pdf
Final Project_Text Classification on Toxic Comments.pptx		Final Project_Text Classification on Toxic Comments.pptx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Final Project: Text Classification on Toxic Comments

Project Overview

Dataset

Models

Experienment Results

Model Evaluation and Metrics Summary

Future Work

Helpful Information

Environment

Transformer Version

About

Releases

Packages

Languages

Shuo-Wang-UCBerkeley/Text-Classification

Folders and files

Latest commit

History

Repository files navigation

Final Project: Text Classification on Toxic Comments

Project Overview

Dataset

Models

Experienment Results

Model Evaluation and Metrics Summary

Future Work

Helpful Information

Environment

Transformer Version

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages