TL;DR: this repo documents progress in automatic dubbing science, applications, and related topics.
Briefly automatic dubbing (AD) is a task of reconstructing an audiovisual content using artificially synthesized audio and/or video, for language(s) different from the original production. For instance, AD can be used for dubbing the movie Interstellar from the production language (English) into other languages like Italian, Chinese, or Amharic. This is what makes AD cool! as one of the most exciting applications of ML/AI. However, the science behind AD is still evolving. In an effort to track progress in AD, this repo documents scientific literature, tools, and other relevant materials.
In practice, AD works by machine translating the production language speech (source) into the new dubbing language (target), followed by prosodic alignment of target and source scripts, synthesizing target speech, and finally generation of the dubbed video. In AD, audiovisual coherence is indispensable, as no one enjoys the experience of watching a show/movie with timing mismatch between the visual and audio stream - lip-sync error.
Note
- Organization - notes in this repo has three main sections: science, application, and related topics.
- Materials - are directly or indirectly utilized for an AD or related use cases such as automatic voice-over, lip-syncing, or sub-titling.
- Missing item - if you know a paper, and don’t see it documented here, please do a PR or send the info over email, Thanks!
Science is organized after an AD pipeline, covering an Automatic Speech Recognition (SR), Machine Translation (MT), Prosodic Alignment (PA), and Text-to-Speech (TS) modules.
The Two Shades of Dubbing in Neural Machine Translation, COLING, 2022
- Proposes to apply relaxed translation length constraint for off-screen (i.e when a character lip is not visible on screen) AD, to ultimately minimize the impact on translation quality.
A cascaded and an End-to-end approaches are considered as an application of AD. Cascaded AD is a pipeline of independently trained and managed models such as ST, MT, PA, and TS. In contrast, End-to-end AD utilizes a single model to achieve the tasks in cascaded AD.
From Speech-to-Speech Translation to Automatic Dubbing, IWSLT, 2020
- Proposes to leverage robust SR, MT that controls the output length, PA to align source speech segments with MT outputs, and TS with a feature to adjust the duration of speech segments, to create an AD experience.
Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos, arXiv, 2022
- Focuses on synthetic lip/face generation for a lip-sync AD, using SR with word emphases detection, MT with emphases transfer capability, and TS with a voice conversion module.
Section is dedicated to document materials such as books, articles, tools, and relevant analysis on AD.