Implementation of the project Self-supervised pretraining for phoneme recognition, and generalization on foreign languages
Authors: Apavou Clรฉment & Belkada Younes & Leo Tronchon & Arthur Zucker
This repository is powered by HuggingFace ๐ค, Pytorch-Lightning and Weight & Biases.
The scarcity of annotated data, and the heavy cost of producing them, limits our ability to train deep neural network for audio processing tasks.Therefore, the speech community developed feature learning methods with a minimal need for annotated data, which mostly fall under unsupervised and self-supervised techniques.
Recently, the rise of self-supervised learning methods for textual modality has outperformed state-of-the-art methods on downstream tasks, by fine-tuning the pretrained models on a relatively small amount of data. These approaches have recently been tested for other modalities such as images and audios.
Phoneme recognition is an exciting challenge that involves processing a raw audio recording and predict the corresponding sequence of phonemes that are pronounced by the speaker. Throughout this project, we will compare specifically three different self-supervised models, Wav2vec (2019, 2020), HuBERT (2021) and WavLM (2022) pretrained on a corpus of English speech that we will use in various ways to perform phoneme recognition for different languages with a network trained with Connectionist Temporal Classification (CTC) algorithm. Different questions will be addressed:
- What is the impact of choosing English as a pretrained language, especially for languages that are very different from English? Which method(s) works best for transferring knowledge from English to other languages?
- Which method allows to extract the best features for phoneme recognition?
- What is the influence of the abundance of training data on the performance of models?
In this project, we address these questions by drawing conclusions from our experiments.
- Modularity between SOTA models in self-supervision for speech
- Freedom to select any languages available on CommonVoice hosted at HuggingFace.
- Nice visualization tool through wandb.
Diagram of the models used for the experiments. N=22 and h=1024 for HuBERT Large and WavLM Large, and N=11 and h=768 for Wav2vec2 Base and WavLM Base. Made by us.
Dutch (du), Spanish (es), French (fr), Italian (it), Kyrgyz (ky), Russian (ru), Sweedish (sv), Turkish (tr), Tatar (tt) and Mandarin (zh). From https://github.com/facebookresearch/CPC_audio.
Please refer to our example notebook if you want to train or test a model. To understand the command line arguments that you can use, run:
Hparams ['parameters.hparams']:
Hyperparameters of for the run
--wandb_entity str wandb (default: asr-project)
--debug bool (default: False)
--test bool test code before running, if testing, no checkpoints are written (default: True)
--wandb_project str (default: test-asr)
--root_dir str root_dir (default: /home/arthur/Work/MVA-S2/Speech/Multilingual-PR)
--seed_everything [int]
basic params (default: None)
--gpu int number or gpu (default: 1)
--hparams.max_epochs int
maximum number of epochs (default: 100)
--weights_path str (default: /home/arthur/Work/MVA-S2/Speech/Multilingual-PR/weights)
--tune_lr bool modes (default: False)
--dev_run bool (default: False)
--train bool (default: True)
--best_model str (default: )
--log_freq_audio int (default: 10)
--log_nb_audio int (default: 2)
--val_check_interval float
trainer params (default: 1.0)
--limit_train_batches float
1.0 (default: 1.0)
--limit_val_batches float
1.0 (default: 1.0)
--enable_progress_bar bool
(default: True)
--best_model_run str testing params (default: WavLM_sv)
--early_stopping bool
Early Stopping (default: True)
--early_stopping_params typing.Dict[str, typing.Any]
(default: {'monitor': 'val/per', 'patience': 10, 'mode': 'min', 'verbose': True})
DatasetParams ['parameters.data_param']:
Dataset Parameters
! The batch_size and number of crops should be defined here
--dataset_name str Hugging Face datasets parameters (default: common_voice)
--use_auth_token bool
True if use mozilla-foundation datasets (default: False)
--subset str (default: sv-SE)
--download_mode str chosen language (see https://huggingface.co/datasets/common_voice) (default: reuse_dataset_if_exists)
--cache_dir str (default: /home/arthur/Work/MVA-S2/Speech/Multilingual-PR/assets)
--language str to create vocabulary of phonemes (default: sv)
--root_path_annotation str
(default: /home/arthur/Work/MVA-S2/Speech/Multilingual-PR/assets/common_voices_splits)
--phoible_csv_path str
(default: /home/arthur/Work/MVA-S2/Speech/Multilingual-PR/assets)
--num_workers int Dataloader parameters (default: 20)
--batch_size int (default: 2)
--max_input_length_in_sec float
Dataset processing parameters (default: 5)
--num_proc int (default: 4)
--create_dataset bool
(default: False)
NetworkParams ['parameters.network_param']:
NetworkParams(network_name: str = 'WavLM', pretrained_name: Union[str, NoneType] = '', freeze: bool = True, freeze_transformer: bool = True, eos_token: str = '</s>', bos_token: str = '<s>', unk_token: str = '<unk>', pad_token: str = '<pad>', word_delimiter_token: str = '|')
--network_name str Hubert, Wav2Vec2, WavLM (default: WavLM)
--pretrained_name [str]
(default: )
--freeze bool (default: True)
--freeze_transformer bool
(default: True)
--eos_token str Phoneme Tokenizer (default: </s>)
--bos_token str (default: <s>)
--unk_token str (default: <unk>)
--pad_token str (default: <pad>)
--word_delimiter_token str
(default: |)
OptimizerParams ['parameters.optim_param']:
Optimization parameters
--optimizer str (default: AdamW)
--lr float (default: 0.02)
--weight_decay float (default: 1e-08)
--accumulate_grad_batches int
1 for no accumulation (default: 16)
--scheduler [str] Scheduler parameters (default: None)
--optim_param.max_epochs int
Cosine, ReduceLROnPlateau, MultiStepLR, StepLR or None Cosine scheduler (default: 10)
--warmup_epochs int (default: 1)
--warmup_start_lr float
(default: 0.0006)
--eta_min float (default: 5e-06)
--step_size int Step LR scheduler (default: 2)
--gamma float also for multi step lr (default: 0.1)
--milestones str MultiStepLR scheduler (default: [8, 10, 15])
--min_lr float ReduceLROnPlateau scheduler (default: 5e-09)
--patience int (default: 10)
The project is based on Mozilla CommonVoice dataset available on HuggingFace. When the script is launched, the program will automatically download the correct dataset and transform ground truth sentences to phonemes using phonemizer. You are free to chose any dataset available on HuggingFace with phonemes dictionaries previously cited to run your models. For our experiments we use:
it, nl, tr, ru, sv-SE
Feel free to try any other languages and submit a Pull Request ๐.
Schema of Wav2vec2, HuBERT and WavLM.
For our experiments, we used models hosted on Hugging Face library, that are pre-trained on 960 hours of English audio data from Librispeech dataset on 16kHz sampled speech audio. The following pre-trained models were used:
- Wav2vec2 Base: facebook/wav2vec2-base-960h
- WavLM Base: microsoft/wavlm-base
- WavLM Large: microsoft/wavlm-large
- HuBERT Large: facebook/hubert-large-ls960-ft
The language family tree can be found in the following figure. This gives insight on the genetic proximity of each language.
Language | Family | Proximity with English |
---|---|---|
Italian ๐ฎ๐น | ย Romance | 47.8 |
Russian ๐ท๐บ | East Slavic | 60.3 |
Dutch ๐ณ๐ฑ | West Germanic | 27.2 |
Swedish ๐ธ๐ช | North Germanic | 26.7 |
Turkish ๐น๐ท | Turkic | 92.0 |
Genetic proximity between languages studied and english computed [here](http://www.elinguistics.net/Compare_Languages.aspx). [1, 30]: Highly related languages, [30, 50]: Related languages, [50, 70]: Remotely related languages, [70, 78]: Very remotely related languages, [78, 100]: No recognizable relationship.
English is a part of the West Germanic family.
Source: https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md and http://www.elinguistics.net/Compare_Languages.aspx
dataset: Common Voice Corpus 6.1 : https://commonvoice.mozilla.org/fr/datasets
Pretrained English models to other languages
Table of experiments when models are **fine tuned**. Here, we compare 3 different pretrained models. The models were fine tuned on the phoneme recognition task with different languages and a varying amount of training data.
Table of experiments using **frozen features**. Here, we compare 4 different pretrained models. The objective was to train a linear layer, using pretrained models' frozen features, on the phoneme recognition task with different languages and a varying amount of training data.
Variation in the amount of training data with frozen features of models pre-trained with the 3 different methods. Language: Swedish ๐ธ๐ช.
PER on the test and validation sets vs Training data for the Swedish language with frozen features.
โโโ agents
| โโโ BaseTrainer.py
|
โโโ assets # database and vocab phonemes are put here
|
โโโ config
| โโโ hparams.py # configuration file
|
โโโ Datasets
| |
| โโโ datamodule.py #ย datamodules PyTorch lightning for CommonVoice dataset
|
โโโ models
| โโโ BaseModule.py # lightning module
| โโโ models.py # Wav2vec2 WavLM and Hubert using Hugging Face library
|
โโโ utils # utils functions
| โโโ agent_utils.py
| โโโ callbacks.py
| โโโ dataset_utils.py
| โโโ logger.py
| โโโ metrics.py
| โโโ per.py # torch metrics implementation of the phoneme error rate
|
โโโ hparams.py # configuration file
|
โโโ main.py # main script to launch for training of inference
|
โโโ README.md