This is a PyTorch implementation of chunk-level speech emotion recognition (SER) framework in the paper for the MSP-Podcast corpus.
- Python 3.6+
- Ubuntu 18.04+
- torch version 1.4.0+
- CUDA 10.0+
- The scipy, numpy and pandas...etc conventional packages
- The MSP-Podcast corpus (request to download from UTD-MSP lab website)
- The IS13ComParE LLDs (acoustic features) extracted by OpenSmile (users can refer to the opensmile-LLDs-extraction repository)
After extracted the IS13ComParE LLDs (e.g., XXX_llds/feat_mat/*.mat) for MSP-Podcast [whatever version] corpus, we use the 'labels_concensus.csv' provided by the corpus as the default input label setting.
-
change data & label root paths in norm_para.py, then run it to get z-norm parameters (mean and std) based on the Train set. We also provide the parameters of the v1.6 corpus in the 'NormTerm' folder.
-
change data & label root paths in training.py for LSTM model, the running args are,
- -iter: maximum training iterations
- -batch: batch size for training
- -emo: emotion attributes (Act, Dom or Val)
- -atten: type of chunk-level attneiton model (NonAtten, GatedVec, RnnAttenVec or SelfAttenVec)
- run in the terminal
python training.py -iter 5000 -batch 128 -emo Dom -atten SelfAttenVec
- change data & label & model root paths in testing.py for the testing results based on the MSP-Podcast test set,
- run in the terminal
python testing.py -iter 5000 -batch 128 -emo Dom -atten SelfAttenVec
We provide some trained models based on version 1.6 of the MSP-Podcast in the 'trained_model_v1.6' folder. These PyTorch implementation models have been verified to have similar CCC performance trends with the original paper using Keras implementation.
Model | Act | Val | Dom |
---|---|---|---|
LSTM-RnnAttenVec (Keras) | 0.6955 | 0.3006 | 0.6175 |
LSTM-SelfAttenVec (Keras) | 0.6837 | 0.3337 | 0.6004 |
LSTM-RnnAttenVec (PyTorch) | 0.6906 | 0.2747 | 0.6132 |
LSTM-SelfAttenVec (PyTorch) | 0.7099 | 0.3206 | 0.6299 |
Users can get these results by running the testing.py with corresponding args.
The implementation is for the MSP-Podcast corpus, however, the framework can be applied on general speech-based sequence-to-one tasks (e.g., speaker recognition, gender detection, acoustic event classification or SER...etc). If you want to apply the framework on your own tasks, here are some important parameters need to be specified in the DynamicChunkSplitData functions under the utils.py file,
- max duration in second of your corpus (i.e., Tmax)
- desired chunk window length in second (i.e., Wc)
- number of chunks splitted in a sentence (i.e., C = ceiling of Tmax/Wc)
- number of frames within a chunk (i.e., m)
- scaling factor to increase the splited chunks number (i.e., n=1, 2 or 3 are suggested)
- remember to change NN model dimensions: feat_num, time_step and C
If you use this code, please cite the following paper:
Wei-Cheng Lin and Carlos Busso, "Chunk-Level Speech Emotion Recognition: A General Framework of Sequence-to-One Dynamic Temporal Modeling"
@article{Lin_2023_4,
author={W.-C. Lin and C. Busso},
title={Chunk-Level Speech Emotion Recognition: A General Framework of Sequence-to-One Dynamic Temporal Modeling},
journal={IEEE Transactions on Affective Computing},
number={2},
volume={14},
pages={1215-1227},
year={2023},
month={April-June},
doi={10.1109/TAFFC.2021.3083821},
}