Skip to content

Course Project for CS769: Advanced Natural Language Processing

Notifications You must be signed in to change notification settings

abhayk1201/CS769_Project

Repository files navigation

CS769_Project

BERT-MTL: Multi-Task Learning Paradigm for Improved Emotion Classification using BERT Model

Course Project for CS769: Advanced Natural Language Processing

Team Members: Team #8

  • Abhay Kumar (github username: abhayk1201)
  • Neal B Desai (github username: nbdesai1992)
  • Priyavarshini Murugan (github username: PriyavarshiniM)

Report:

Install

If running on Google Colab, just install the python packages using pip install -r requirements.txt

Pre-requisites: python 3.7 pytorch setup.sh will install the goemotion environment with required python packages.

Colab notebooks

These are some colab notebooks for easy-to-run experiments. However, for detailed experimental setup, follow the instructions in the following sections.

Goemotion Data

GoEmotions is a corpus extracted from Reddit with human annotations to 28 emotion labels (27 emotion categories + Neutral).

  • Processed data is already uploaded to data folder in this repository for convenience, you can skip the following steps.
  • Goemotion Dataset Link
  • Dataset splits: training dataset (43,410), test dataset (5,427), and validation dataset (5,426).
  • Maximum sequence length in training and evaluation datasets: 30
  • The emotion categories are: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise.
  • Raw dataset can be downloaded using
wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv
wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_2.csv
wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_3.csv

Geomotion Data Hierarchial grouping

  • Original GoEmotions (27 emotions + neutral)
  • Sentiment Grouping (positive, negative, ambiguous + neutral)
  • Ekman (anger, disgust, fear, joy, sadness, surprise + neutral), where anger : anger, annoyance, disapproval, disgust : disgust, fear : fear, nervousness, joy : all positive emotions, sadness : sadness, disappointment, embarrassment, grief, remorse surprise : all ambiguous emotions

Config directory has the respecttive config files for above groupings/taxonomy.

Single Task (Goemotion) Running instructions

Change corresponding Config/{}.json file for the required taxonomy and pass the grouping/taxonomy as an argument like. You can set do_train, do_eval depending on whether you want training, evaluation or both. You can also change different hyperparameters like train_batch_size, learning_rate, num_train_epochs etc.

$ python3 goemotions_classifier.py --taxonomy {$TAXONOMY}

$ python3 goemotions_classifier.py --taxonomy original
$ python3 goemotions_classifier.py --taxonomy sentiment
$ python3 goemotions_classifier.py --taxonomy ekman

To run following tasks in single task setting, just add the corresponding Config/{}.json file and run that as the taxonomy.

$ python3 goemotions_classifier.py --taxonomy {$TAXONOMY}

Sentiment140 Data

Suicide and Depression Detection Data

Multi-task Learning Running (MTL) instructions

Setup: Download the pretrained BERT pytorch model from google drive link. You should skip the next step if you download this pytorch pre-trained model. We have already shared google drive link after doing the conversion.

Otherwise, You can convert TensorFlow checkpoint for BERT (pre-trained models released by Google) in a PyTorch save file by using the convert_tf_checkpoint_to_pytorch.py script.

If you are running in google colb, modify your path variables as per your setup and run the following code.

python run_multi_task.py \
  --seed 42 \
  --output_dir ./Model/MTL \
  --tasks all \
  --sample 'anneal'\
  --multi \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir ./data/ \
  --vocab_file ./uncased_L-12_H-768_A-12/vocab.txt \
  --bert_config_file ./config/pals_config.json \
  --init_checkpoint ./uncased_L-12_H-768_A-12/pytorch_model.bin \
  --max_seq_length 50 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 10 \
  --gradient_accumulation_steps 1

Note the different arguments:

  • tasks: all will run for all tasks together. 'single' will run for a single task (you need to pass another arg --task_id for the required task)
  • sample: Different sampling schemes for training different tasks.
    • rr: Round Robin: Select a batch of training examples from each task, cycling through them in a fixed order. However, this may not work well if differnt tasks have different numbers of training examples.
    • prop: Select a batch of examples from task i with probability at each training step, where proportional to , the training dataset size for task.
    • sqrt: Similar as prop, but sampling is proportional to the square root of the training dataset size.
    • anneal: annealed sampling method changes the proportion with each epoch so as to become more more equally towards the end of training (where we are most concerned about interference).
  • max_seq_length: The maximum total input sequence length after WordPiece tokenization. Otherwise, it will be trucated or padded.
  • train_batch_size: Batch size for training.
  • num_train_epochs: Number of epochs of training.
  • do_train: If you want to run training.
  • do_eval: Whether to run eval on the dev set
  • data_dir: Directory path which contains different task dataset.
  • vocab_file: Directory path of pretrained model downloaded from google drive link in the setup above.
  • init_checkpoint: Directory path of pretrained model downloaded from google drive link in the setup above.
  • bert_config_file: Differnt configs for Multi Task Learning settings.

Modify the ./config/pals_config.json config file for adjusting some hyperparameters and settings. To run the different MTL experiments, you can also use the ./run.sh. Additionally, refer to the easy-to-use colab notebook to refer previosuly run experiments.

References

Awesome Multi-Task Learning

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Goemotion Google Data and Baseline Model

About

Course Project for CS769: Advanced Natural Language Processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published