Grad-TTS

Official implementation of the Grad-TTS model based on Diffusion Probabilistic Modelling. For all details check out our paper accepted to ICML 2021 via this link.

Authors: Vadim Popov*, Ivan Vovk*, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov.

^{*Equal contribution.}

Abstract

Demo page with voiced abstract: link.

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score.

Installation

Firstly, install all Python package requirements:

pip install -r requirements.txt

Secondly, build monotonic_align code (Cython):

cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..

Note: code is tested on Python==3.6.9.

Inference

You can download Grad-TTS and HiFi-GAN checkpoints trained on LJSpeech* and Libri-TTS datasets (22kHz) from here.

*Note: we open-source 2 checkpoints of Grad-TTS trained on LJSpeech. They are the same models but trained with different positional encoding scale: x1 ("grad-tts-old.pt", ICML 2021 sumbission model) and x1000 ("grad-tts.pt"). To use the former set params.pe_scale=1 and to use the latter set params.pe_scale=1000. Libri-TTS checkpoint was trained with scale x1000.

Put necessary Grad-TTS and HiFi-GAN checkpoints into checkpts folder in root Grad-TTS directory (note: in inference.py you can change default HiFi-GAN path).

Create text file with sentences you want to synthesize like resources/filelists/synthesis.txt.
For single speaker set params.n_spks=1 and for multispeaker (Libri-TTS) inference set params.n_spks=247.
Run script inference.py by providing path to the text file, path to the Grad-TTS checkpoint, number of iterations to be used for reverse diffusion (default: 10) and speaker id if you want to perform multispeaker inference:
```
python inference.py -f <your-text-file> -c <grad-tts-checkpoint> -t <number-of-timesteps> -s <speaker-id-if-multispeaker>
```
Check out folder called out for generated audios.

You can also perform interactive inference by running Jupyter Notebook inference.ipynb or by using our Google Colab Demo.

Training

Make filelists of your audio data like ones included into resources/filelists folder. For single speaker training refer to jspeech filelists and to libri-tts filelists for multispeaker.
Set experiment configuration in params.py file.

Specify your GPU device and run training script:

export CUDA_VISIBLE_DEVICES=YOUR_GPU_ID
python train.py  # if single speaker
python train_multi_speaker.py  # if multispeaker

To track your training process run tensorboard server on any available port:
```
tensorboard --logdir=YOUR_LOG_DIR --port=8888
```
During training all logging information and checkpoints are stored in YOUR_LOG_DIR, which you can specify in params.py before training.

References

HiFi-GAN model is used as vocoder, official github repository: link.
Monotonic Alignment Search algorithm is used for unsupervised duration modelling, official github repository: link.
Phonemization utilizes CMUdict, official github repository: link.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
checkpts		checkpts
data		data
hifi-gan		hifi-gan
logs/bahnar_exp		logs/bahnar_exp
model		model
out		out
resources		resources
text		text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICE		THIRD_PARTY_NOTICE
add_silent.py		add_silent.py
analyze.ipynb		analyze.ipynb
app.py		app.py
app_gpu.py		app_gpu.py
data.py		data.py
environment.yml		environment.yml
gunicorn-cfg.py		gunicorn-cfg.py
inference.ipynb		inference.ipynb
inference.py		inference.py
main.py		main.py
params.py		params.py
requirements.txt		requirements.txt
test.wav		test.wav
train.py		train.py
train_multi_speaker.py		train_multi_speaker.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grad-TTS

Abstract

Installation

Inference

Training

References

About

Releases

Packages

Languages

License

nhatkhangcs/bana-tts

Folders and files

Latest commit

History

Repository files navigation

Grad-TTS

Abstract

Installation

Inference

Training

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages