UC Berkeley MIDS Program
Shuo Wang, Ci Song
Spring 2024
Our project is heavily isnpired by the Jigsaw Toxic Comment Classification challenge on Kaggle.
In this research project, a banch of traditional machines learnings models and advanced Natural Language Processing (NLP) transformer models have been utilized to do the toxic comment classificaiton. Through meticulous experimentation on the Jigsaw Toxic Comment Classification dataset, it was revealed that fine-tuned transformer-based models not only substantially improved accuracy, precision, recall, F1 score, and ROC AUC metrics but also demonstrated RoBERTa and DistilBERT’s slight superiority across nearly all metrics. A model evaluation and metrics summary table is at the end of the project.
- Baseline Models
- CountVectorizer - Complement Naive Bayes (CNB)
- CountVectorizer - Multinomial Naive Bayes (MNB)
- TfidfVectorizer - CNB
- TfidfVectorizer - MNB
- Transformer Models
- BART
- BERT
- BERT+CNN
- DistilBERT
- DistilBERT+CNN
- ALBERT
- RoBERTa
- Bidirectional_GRU
- Models
- Deep Averaging Network (DAN)
- DAN-Static
- DAN_Retrain_word2vec
- DAN-REtrain_uniform
- Weighted Averaging Networks (WANs)
- Logestic Regression
- CNN
- CNN-non_Retrain
- CNN-Retrain
- RNN
- RNN-non_Retrain
- RNN-Retrain
- CNN_RNN
- CNN_RNN-non_Retrain
- CNN_RNN-Retrain
- Deep Averaging Network (DAN)
-
Multi-Class Text Classification.
The current data is highly imbalanced, a potential solution is to build a two-step models.
-
Parallel Computing Technique
Utilizing parallel computing technique to work for the advanced NLP algorithms, including RoBERTa-LONG, T5, and XLNET. These 3 models resulted in Resource Exhausted Errors with the limited GPU capacity in Google Colab.
A series of notebooks and scripts are run on Google Colaboratory Pro using GPU environment.
Hugging Face updated the Transformers library, to run our code (like BERT) we need to revert to an earlier version of the Transformers library.
!pip install -q transformers==4.37.2