Some questions #55

gelsas · 2024-05-15T07:55:49Z

I have quite a unique use case I think.

I have videos with a length of lets say 3 seconds that have english voiceover. Now I have generated a spanish voice over using text to speech. The spanish voice over is 5 seconds long. Just speeding up the spanish audio will not work since it would sound way to fast. Just slowing down the video also does not work it would look way to slow.

Can I use any of the features of your tool to find the optimal adjustments that would need to be made to the find the best adjustment values between video and spanish voice. What I mean is that I get something like, slow down video by factor X and speed up audio by factor X. so that the adjustments are the least noticable.

I am not sure if your tool supports something like this.

And two additional questions:
could you explain this with an example to me I am not entirely sure I understand what this one does exactly:
Speech-to-translated-transcript alignment.
and for this one as well:
Speech-to-transcript alignment

rotemdan · 2024-05-15T08:25:53Z

This general area is what I'm primarily working on right now.

Speech-to-transcript alignment attempts to find the approximate timing of words and sentences of the transcript you give it, in the speech audio you give it.

Speech-to-translated-transcript alignment finds the approximate timing of the words and sentences of the translated transcript you give it, in the speech audio you give. Currently only English is supported as the target language since the Whisper translation task only supports English as a target.

I'm currently working on a different, hybrid approach that would expand it to about 100 target languages, other than English. There are actually several variants:

The speech is transcribed in its native language, and then the recognized transcript is aligned to the translated transcript using semantic cross-language text-to-text alignment, based on a multilingual embedding model like E5-multilingual. This is currently in development (the cross-language text-to-text word alignment is already done). One problem is that misheard words can have very different meanings, so they would reduce the accuracy in those cases, since the text-to-text alignment is based purely on semantics, not audible similarities of the spoken words.
You provide the speech, the native transcript, and the translated transcript. The native transcript is aligned to the speech, and then the translated transcript is aligned to the native transcript, and the timing is mapped in this way. This requires providing 3 different inputs in the CLI, so would require a specialized operation. I expect it to be potentially very accurate.
You provide a JSON timeline previously generated using alignment, and the translated transcript, and then the translated transcript is aligned with the words in the timeline using semantic text-to-text cross-language alignment. This would be an efficient and fast operation, since you're providing a precomputed alignment timeline. It can be repeated for many translated transcripts without needing to realign with the speech each time.

What you are describing sounds more like Speech-to-translated-speech alignment. You want to find the mapping between an original speech to a translated (synthesized or dubbed) version of it. This is also possible. Here are some approaches on how to do it:

You provide the original speech and its transcript, and the translated speech and its transcript and then both the original speech and the translated speech are aligned to their transcripts, producing two different timelines. Then we use semantic text-to-text alignment to map between the two timelines, in the same way that is done in the above approaches.
You provide the original speech and its transcript, and the translated transcript. The translated transcript is synthesized using TTS, and then a similar operation is done to get matching between the two timelines.
For every sentence in the original, we synthesize a sentence in the second language and try to ensure its duration fits to approximately match the time of the original sentence. This may be a bit more natural.

Once the translated speech is aligned with the original one, we can use a form of localized time stretching to fit the translated speech to the original such that matching sentences, phrases and words are synchronized with each other as closely as possible.

The main issue with this is probably that the locally stretched speech would sound unnatural. Other than that it's possible to do in the future. But getting things like speech-to-transcript translation and alignment, and also support for machine translation, is a higher priority, since something like this has to build over those features to be viable.

rotemdan added question Further information is requested translation labels Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions #55

Some questions #55

gelsas commented May 15, 2024

rotemdan commented May 15, 2024 •

edited

Loading

Some questions #55

Some questions #55

Comments

gelsas commented May 15, 2024

rotemdan commented May 15, 2024 • edited Loading

rotemdan commented May 15, 2024 •

edited

Loading