Trying to do force-align a Japanese audiobook with the book's text.
The basic idea is to match each sentence in the book's text with its utterance in the audiobook. I am interested in doing so to easily get the audio corresponding to Anki card I mine for a book, or even generate an Anki deck with both the text and audio for every sentence in the book.
As a simple benchmark, I try to perform the same alignment using the methods described below.
To install the required dependencies:
pip3 install -r requirements.txt
Code: test_aeneas.sh
aeneas is a Python/C library that perform forced-alignment by generating speech from the text, and comparing this audio with the original audio (using the DTW algorithm to compute an alignment between the two sequences).
Using my furiganalyse tool, I converted the EPUB file's chapters to TXT files and took a sample to feed aeneas with.
Performance completely depends on the TTS (text-to-speech) engine used internally:
- AWS Polly: the alignment is perfect ! Just a few silences remaining at the beginning of some sentences, but that can be easily dealt with.
- espeak-ng (open-source TTS suggested in aeneas sample codes): did not work at all, this might be due to phonemes that are recognized
Code: test_afaligner.py
Afaligner is a library for forced-alignment of books with audiobooks. It uses aeneas under the hood, so it worked equally well as the standalone aeneas.
The only difference is that it is using fragments from an XHTML as the sentences to align. I had to add those fragment ids to the XHTML, as the original files did not have them.
Note: if you get RuntimeError: Both the C extension and the pure Python code failed. (Wrong arguments? Input too big?)
,
it could happen because you did not setup your AWS credentials.
Code: test_pyjuliusalign.py
pyjuliusalign is a library that performs forced alignment specifically for Japanese, using speech recognition engine Julius. It takes the opposite approach as aeneas: it recognized the phonemes in the audio (using Julius), then compares them to the phonemes extracted from the text to get the alignment.
Something is not working, need to debug it...