This repository implements Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) for the Persian language. The core codebase is derived from this repository, which has been updated to address deprecated features and complete setup for Persian language compatibility. The original codebase, sourced from this repository, has been modified to support Persian language requirements.
1. Character-set definition:
Open the synthesizer/persian_utils/symbols.py
file and update the _characters
variable to include all the characters that exist in your text files. Most of Persian characters and symbols are already included in this variable as follows:
_characters = "ءابتثجحخدذرزسشصضطظعغفقلمنهويِپچژکگیآۀأؤإئًَُّ!(),-.:;? ̠،…؛؟٪#ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_–@+/\u200c"
2. Data structures:
dataset/persian_date/
train_data/
speaker1/book-1/
sample1.txt
sample1.wav
...
...
test_data/
...
3. Preprocessing:
python synthesizer_preprocess_audio.py dataset --datasets_name persian_data --subfolders train_data --no_alignments
python synthesizer_preprocess_embeds.py dataset/SV2TTS/synthesizer
4. Train synthesizer:
python synthesizer_train.py my_run dataset/SV2TTS/synthesizer
5. Inference:
For synthesizing wav file you must put all final models in saved_models/final_models
directory.
If you do not train speaker encoder and vocoder models you can use pretrained models in saved_models/default
.
Inference using WavRNN as vocoder:
python inference.py --vocoder "WavRNN" --text "یک نمونه از خروجی" --ref_wav_path "/path/to/sample/refrence.wav" --test_name "test1"
But WavRNN is an old vocoder and if you want to use HiFiGAN you must first download a pretrained model in English.
First, install the parallel_wavegan package. See this package for more information.
pip install parallel_wavegan
Then download pretrained HiFiGAN to your saved models:
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("vctk_hifigan.v1", "saved_models/final_models/vocoder_HiFiGAN")
Now you can use HiFiGAN as a vocoder in inference command:
python inference.py --vocoder "HiFiGAN" --text "یک نمونه از خروجی" --ref_wav_path "/path/to/sample/refrence.wav" --test_name "test1"
You can find output samples synthesized by the trained model from this study (link to be updated) in this directory along with the same utterances generated by two baseline models, the natural utterances, and utterances with gold spectrograms where the waveform is generated by the vocoder used in the study.