-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions #55
Comments
This general area is what I'm primarily working on right now. Speech-to-transcript alignment attempts to find the approximate timing of words and sentences of the transcript you give it, in the speech audio you give it. Speech-to-translated-transcript alignment finds the approximate timing of the words and sentences of the translated transcript you give it, in the speech audio you give. Currently only English is supported as the target language since the Whisper translation task only supports English as a target. I'm currently working on a different, hybrid approach that would expand it to about 100 target languages, other than English. There are actually several variants:
What you are describing sounds more like Speech-to-translated-speech alignment. You want to find the mapping between an original speech to a translated (synthesized or dubbed) version of it. This is also possible. Here are some approaches on how to do it:
Once the translated speech is aligned with the original one, we can use a form of localized time stretching to fit the translated speech to the original such that matching sentences, phrases and words are synchronized with each other as closely as possible. The main issue with this is probably that the locally stretched speech would sound unnatural. Other than that it's possible to do in the future. But getting things like speech-to-transcript translation and alignment, and also support for machine translation, is a higher priority, since something like this has to build over those features to be viable. |
I have quite a unique use case I think.
I have videos with a length of lets say 3 seconds that have english voiceover. Now I have generated a spanish voice over using text to speech. The spanish voice over is 5 seconds long. Just speeding up the spanish audio will not work since it would sound way to fast. Just slowing down the video also does not work it would look way to slow.
Can I use any of the features of your tool to find the optimal adjustments that would need to be made to the find the best adjustment values between video and spanish voice. What I mean is that I get something like, slow down video by factor X and speed up audio by factor X. so that the adjustments are the least noticable.
I am not sure if your tool supports something like this.
And two additional questions:
could you explain this with an example to me I am not entirely sure I understand what this one does exactly:
Speech-to-translated-transcript alignment.
and for this one as well:
Speech-to-transcript alignment
The text was updated successfully, but these errors were encountered: