Skip to content

Shuo-Wang-UCBerkeley/Text-Classification

Repository files navigation

Final Project: Text Classification on Toxic Comments

UC Berkeley MIDS Program

Shuo Wang, Ci Song

Spring 2024

Our project is heavily isnpired by the Jigsaw Toxic Comment Classification challenge on Kaggle.

Project Overview

In this research project, a banch of traditional machines learnings models and advanced Natural Language Processing (NLP) transformer models have been utilized to do the toxic comment classificaiton. Through meticulous experimentation on the Jigsaw Toxic Comment Classification dataset, it was revealed that fine-tuned transformer-based models not only substantially improved accuracy, precision, recall, F1 score, and ROC AUC metrics but also demonstrated RoBERTa and DistilBERT’s slight superiority across nearly all metrics. A model evaluation and metrics summary table is at the end of the project.

Dataset

alt text

Models

  • Baseline Models
    • CountVectorizer - Complement Naive Bayes (CNB)
    • CountVectorizer - Multinomial Naive Bayes (MNB)
    • TfidfVectorizer - CNB
    • TfidfVectorizer - MNB
  • Transformer Models
    • BART
    • BERT
    • BERT+CNN
    • DistilBERT
    • DistilBERT+CNN
    • ALBERT
    • RoBERTa
    • Bidirectional_GRU
  • Models
    • Deep Averaging Network (DAN)
      • DAN-Static
      • DAN_Retrain_word2vec
      • DAN-REtrain_uniform
    • Weighted Averaging Networks (WANs)
    • Logestic Regression
    • CNN
      • CNN-non_Retrain
      • CNN-Retrain
    • RNN
      • RNN-non_Retrain
      • RNN-Retrain
    • CNN_RNN
      • CNN_RNN-non_Retrain
      • CNN_RNN-Retrain

Experienment Results

alt text

Model Evaluation and Metrics Summary

alt text alt text

Future Work

  1. Multi-Class Text Classification.

    The current data is highly imbalanced, a potential solution is to build a two-step models.

  2. Parallel Computing Technique

    Utilizing parallel computing technique to work for the advanced NLP algorithms, including RoBERTa-LONG, T5, and XLNET. These 3 models resulted in Resource Exhausted Errors with the limited GPU capacity in Google Colab.

alt text

Helpful Information

Environment

A series of notebooks and scripts are run on Google Colaboratory Pro using GPU environment.

Transformer Version

Hugging Face updated the Transformers library, to run our code (like BERT) we need to revert to an earlier version of the Transformers library.

!pip install -q transformers==4.37.2

Back-To-Top

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published