CS769_Project

BERT-MTL: Multi-Task Learning Paradigm for Improved Emotion Classification using BERT Model

Course Project for CS769: Advanced Natural Language Processing

Team Members: Team #8

Abhay Kumar (github username: abhayk1201)
Neal B Desai (github username: nbdesai1992)
Priyavarshini Murugan (github username: PriyavarshiniM)

Report:

Install

If running on Google Colab, just install the python packages using pip install -r requirements.txt

Pre-requisites: python 3.7 pytorch setup.sh will install the goemotion environment with required python packages.

Colab notebooks

These are some colab notebooks for easy-to-run experiments. However, for detailed experimental setup, follow the instructions in the following sections.

Goemotion Data

GoEmotions is a corpus extracted from Reddit with human annotations to 28 emotion labels (27 emotion categories + Neutral).

Processed data is already uploaded to data folder in this repository for convenience, you can skip the following steps.
Goemotion Dataset Link
Dataset splits: training dataset (43,410), test dataset (5,427), and validation dataset (5,426).
Maximum sequence length in training and evaluation datasets: 30
The emotion categories are: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise.
Raw dataset can be downloaded using

wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv
wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_2.csv
wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_3.csv

Geomotion Data Hierarchial grouping

Original GoEmotions (27 emotions + neutral)
Sentiment Grouping (positive, negative, ambiguous + neutral)
Ekman (anger, disgust, fear, joy, sadness, surprise + neutral), where anger : anger, annoyance, disapproval, disgust : disgust, fear : fear, nervousness, joy : all positive emotions, sadness : sadness, disappointment, embarrassment, grief, remorse surprise : all ambiguous emotions

Config directory has the respecttive config files for above groupings/taxonomy.

Single Task (Goemotion) Running instructions

Change corresponding Config/{}.json file for the required taxonomy and pass the grouping/taxonomy as an argument like. You can set do_train, do_eval depending on whether you want training, evaluation or both. You can also change different hyperparameters like train_batch_size, learning_rate, num_train_epochs etc.

$ python3 goemotions_classifier.py --taxonomy {$TAXONOMY}

$ python3 goemotions_classifier.py --taxonomy original
$ python3 goemotions_classifier.py --taxonomy sentiment
$ python3 goemotions_classifier.py --taxonomy ekman

To run following tasks in single task setting, just add the corresponding Config/{}.json file and run that as the taxonomy.

$ python3 goemotions_classifier.py --taxonomy {$TAXONOMY}

Sentiment140 Data

Original Dataset download link
We have shared the processed dataset drive link
Train set: Total of 1,600,000 training tweets (800,000 tweets with positive sentiment, and 800,000 tweets with negative sentiment).
Test set: Composed of 177 negative sentiment tweets and 182 positive sentiment tweets.
Data preparation colab notebook

Suicide and Depression Detection Data

Dataset download link

Multi-task Learning Running (MTL) instructions

Setup: Download the pretrained BERT pytorch model from google drive link. You should skip the next step if you download this pytorch pre-trained model. We have already shared google drive link after doing the conversion.

Otherwise, You can convert TensorFlow checkpoint for BERT (pre-trained models released by Google) in a PyTorch save file by using the convert_tf_checkpoint_to_pytorch.py script.

If you are running in google colb, modify your path variables as per your setup and run the following code.

python run_multi_task.py \
  --seed 42 \
  --output_dir ./Model/MTL \
  --tasks all \
  --sample 'anneal'\
  --multi \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir ./data/ \
  --vocab_file ./uncased_L-12_H-768_A-12/vocab.txt \
  --bert_config_file ./config/pals_config.json \
  --init_checkpoint ./uncased_L-12_H-768_A-12/pytorch_model.bin \
  --max_seq_length 50 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 10 \
  --gradient_accumulation_steps 1

Note the different arguments:

tasks: all will run for all tasks together. 'single' will run for a single task (you need to pass another arg --task_id for the required task)
sample: Different sampling schemes for training different tasks.
- rr: Round Robin: Select a batch of training examples from each task, cycling through them in a fixed order. However, this may not work well if differnt tasks have different numbers of training examples.
- prop: Select a batch of examples from task i with probability $p_i$ at each training step, where $p_i$ proportional to $N_i$ , the training dataset size for $i^{th}$ task.
- sqrt: Similar as prop, but sampling is proportional to the square root of the training dataset size.
- anneal: annealed sampling method changes the proportion with each epoch so as to become more more equally towards the end of training (where we are most concerned about interference).
max_seq_length: The maximum total input sequence length after WordPiece tokenization. Otherwise, it will be trucated or padded.
train_batch_size: Batch size for training.
num_train_epochs: Number of epochs of training.
do_train: If you want to run training.
do_eval: Whether to run eval on the dev set
data_dir: Directory path which contains different task dataset.
vocab_file: Directory path of pretrained model downloaded from google drive link in the setup above.
init_checkpoint: Directory path of pretrained model downloaded from google drive link in the setup above.
bert_config_file: Differnt configs for Multi Task Learning settings.

Modify the ./config/pals_config.json config file for adjusting some hyperparameters and settings. To run the different MTL experiments, you can also use the ./run.sh. Additionally, refer to the easy-to-use colab notebook to refer previosuly run experiments.

References

Awesome Multi-Task Learning

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Goemotion Google Data and Baseline Model

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
colab_notebooks		colab_notebooks
config		config
data		data
scripts		scripts
tmp_bert		tmp_bert
.gitignore		.gitignore
README.md		README.md
data_loader.py		data_loader.py
extract_features.py		extract_features.py
github.txt		github.txt
goemotions_classifier.py		goemotions_classifier.py
model.py		model.py
modeling.py		modeling.py
multilabel.py		multilabel.py
optimization.py		optimization.py
presentation.pdf		presentation.pdf
report.pdf		report.pdf
requirements.txt		requirements.txt
run.sh		run.sh
run_multi_task.py		run_multi_task.py
setup.sh		setup.sh
tokenization.py		tokenization.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS769_Project

BERT-MTL: Multi-Task Learning Paradigm for Improved Emotion Classification using BERT Model

Team Members: Team #8

Install

Colab notebooks

Goemotion Data

Geomotion Data Hierarchial grouping

Single Task (Goemotion) Running instructions

Sentiment140 Data

Suicide and Depression Detection Data

Multi-task Learning Running (MTL) instructions

References

About

Releases

Packages

Languages

abhayk1201/CS769_Project

Folders and files

Latest commit

History

Repository files navigation

CS769_Project

BERT-MTL: Multi-Task Learning Paradigm for Improved Emotion Classification using BERT Model

Team Members: Team #8

Install

Colab notebooks

Goemotion Data

Geomotion Data Hierarchial grouping

Single Task (Goemotion) Running instructions

Sentiment140 Data

Suicide and Depression Detection Data

Multi-task Learning Running (MTL) instructions

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages