Skip to the content.

StyleDubber Towards Multi-Scale Style Learning for Movie Dubbing

image

Abstract

Given a script, the challenge in Movie Dubbing (Visual Voice Cloning, V2C) is to generate speech that aligns well with the video in both time and emotion, based on the tone of a reference audio track. Existing stateof-the-art V2C models break the phonemes in the script according to the divisions between video frames, which solves the temporal alignment problem but leads to incomplete phoneme pronunciation and poor identity stability. To address this problem, we propose StyleDubber, which switches dubbing learning from the frame level to phoneme level. It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; And (3) a phoneme-guided lip aligner to maintain lip sync. Extensive experiments on two of the primary benchmarks, V2C and Grid, demonstrate the favorable performance of the proposed method as compared to the current stateof-the-art. The code will be made available at here.

Demos

Result of Dubbing Setting1

Result of Dubbing Setting2

Result of Dubbing Setting3

The V2C Animation Setting1 Results

Text: “Yes, I’m the baby Jesus” (Please slide left or right)

FastSpeech2
StyleSpeech
FaceTTS
Zeroshot-TTS
V2C-Net
HPMDubbing
Our StyleDubber
GT

The GRID Setting2 Results

Text: “place red with m eight now” (Please slide left or right)

Reference

FastSpeech2
StyleSpeech
FaceTTS
Zeroshot-TTS
V2C-Net
HPMDubbing
Our StyleDubber
GT

The V2C Animation Setting2 Results

Text: “You are not responsible for their choices, elsa.” (Please slide left or right)

FastSpeech2
StyleSpeech
FaceTTS
Zeroshot-TTS
V2C-Net
HPMDubbing
Our StyleDubber
GT

The V2C Animation Setting3 Results (Male voice actors dubbing female characters)

Text: “It’s a lot of responsibility.” (Please slide left or right)

Raw Dubbing Video

Reference

FaceTTS
Zeroshot-TTS
V2C-Net
FastSpeech2
Our StyleDubber
HPMDubbing

The V2C Animation Setting3 Results (Male voice actors dubbing male characters)

Text: “I can’t help. I can’t help anyone.” (Please slide left or right)

Raw Dubbing Video

Reference

FaceTTS
Zeroshot-TTS
V2C-Net
FastSpeech2
Our StyleDubber
HPMDubbing

The V2C Animation Setting3 Results (Female voice actors dubbing male characters)

Text: “I thought you would understand.” (Please slide left or right)

Raw Dubbing Video

Reference

FaceTTS
Zeroshot-TTS
V2C-Net
FastSpeech2
Our StyleDubber
HPMDubbing