This project covers the TTS Data Engine for automatically generating TTS training data from YouTube url.
requirements.txt
I'm not sure... It works "pip install requirements.txt" command
Please Just you refer if you set up your own environments.
-
models/embedding/nnet/model.pth
→"My Custom Embedding model, MFA_Conformer checkpoint path"
https://github.com/zyzisyz/mfa_conformer
this github repository could be helpful for reproducing MFA-Conformer -
models/sc_models/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt
→visit "https://github.com/microsoft/unilm/tree/master/beats"
-
models/sqa_models/model_noresqa_mos.pth
→download from "https://github.com/facebookresearch/Noresqa/blob/main/models/model_noresqa_mos.pth"
git clone not work. Because model size is so big. you download directly.
-
models/sqa_models/wav2vec_small.pt
→visit "https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/README.md", download "Wav2Vec 2.0 Base No finetuning"
-
main.py
- run pipeline and then save results
- If you want to know the format of the output file, please refer to this file
Wb6Oc1_SdJw-ex.json
which was created from urlhttps://www.youtube.com/watch?v=Wb6Oc1_SdJw
-
src/url_loader.py
- download mp4, wav, captions from youtube url
- just use
pytube
package
-
src/diarize.py
- I just use
pyannote v2.1 framework
- I created
src/custom_pyannote
to just conduct experiments individually, but nothing has changed significantly.
- I just use
-
src/asr.py
- I just use
whisper / whisperX framework
- I don't use the VAD module in WhisperX, but simply replaced them with diarization results.
- I proceed with the word alignment, but I didn't use it in the result.
- I just use
-
src/collector.py
- Sound Classification and Speech Quality Assessment analysis of the found utterances are conducted.
- If the position of the utterance is not given, it can be used immediately using the VAD module of WhisperX.
-
src/classify.py
- Estimation of Sound Classification Results to Determine Acoustic Noise in a Given utterance.
- For Sound Classification, we use BEATs iter3+ which is finetuned by AS-2M dataset.
-
src/sqa.py
- Perform a speech quality assessment subjective for a given utterance.
- NORESQA-MOS model is used for the Speech Quality Assessment.
- For Non Matching Reference (NMR), 100 DAPs dataset Clean were randomly cut for 3 seconds.
-
src/vad.py
- voice activity detection module.
- Using the VAD module provided by whisperX.
-
src/enhance.py
- Speech Enhancement Module using FRCRN model which is archived SOTA performance in DNS Challenges.
- we modify
https://github.com/modelscope/modelscope/blob/203a565a3996aa79c80eca34c8409b4334867268/modelscope/pipelines/audio/ans_pipeline.py
tosrc/FRCRN/ans_pipeline.py
to boost the speed through decoding in batch mode.
as well the following works, used in each stage of the pipeline:
@article{bain2022whisperx,
title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
journal={arXiv preprint, arXiv:2303.00747},
year={2023}
}
@article{radford2022robust,
title={Robust speech recognition via large-scale weak supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}
@article{baevski2020wav2vec,
title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
journal={Advances in neural information processing systems},
volume={33},
pages={12449--12460},
year={2020}
}
@inproceedings{bredin2020pyannote,
title={Pyannote. audio: neural building blocks for speaker diarization},
author={Bredin, Herv{\'e} and Yin, Ruiqing and Coria, Juan Manuel and Gelly, Gregory and Korshunov, Pavel and Lavechin, Marvin and Fustes, Diego and Titeux, Hadrien and Bouaziz, Wassim and Gill, Marie-Philippe},
booktitle={ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={7124--7128},
year={2020},
organization={IEEE}
}
@inproceedings{
noresqa,
title={{NORESQA}: A Framework for Speech Quality Assessment using Non-Matching References},
author={Pranay Manocha and Buye Xu and Anurag Kumar},
booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
year={2021},
url={https://proceedings.neurips.cc/paper/2021/file/bc6d753857fe3dd4275dff707dedf329-Paper.pdf}
}
@inproceedings{
noresqamos,
title={Speech Quality Assessment through MOS using Non-Matching References},
author={Pranay Manocha and Anurag Kumar},
booktitle={Interspeech},
year={2022},
url={https://arxiv.org/abs/2206.12285}
}
@misc{kumar2023torchaudiosquim,
title={TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio},
author={Anurag Kumar and Ke Tan and Zhaoheng Ni and Pranay Manocha and Xiaohui Zhang and Ethan Henderson and Buye Xu},
year={2023},
eprint={2304.01448},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
@misc{chen2022beats,
title={BEATs: Audio Pre-Training with Acoustic Tokenizers},
author={Sanyuan Chen and Yu Wu and Chengyi Wang and Shujie Liu and Daniel Tompkins and Zhuo Chen and Furu Wei},
year={2022},
eprint={2212.09058},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
@misc{zhao2022frcrn,
title={FRCRN: Boosting Feature Representation using Frequency Recurrence for Monaural Speech Enhancement},
author={Shengkui Zhao and Bin Ma and Karn N. Watcharasupat and Woon-Seng Gan},
year={2022},
eprint={2206.07293},
archivePrefix={arXiv},
primaryClass={cs.SD}
}