This repository is an implementation of the main part of my master thesis in Data science & Engineering. It is divided in two part:
- Speaker Encoder
- models: ECAPA-TDNN, wavlm-series
- data: VoxCeleb1, private dataset
- Text-to-speech
- model: FastSpeech2 (microsoft implementation)
- data: LibriTTS
This two part are then integrated to achieve a Multi Speaker Text to Speech model that is capable of cloning unseen voices starting from about 5 seconds of audio, the ZeroShotFastSpeech2 model.