学习通过分层韵律模型来配音电影

论文标题

学习通过分层韵律模型来配音电影

Learning to Dub Movies via Hierarchical Prosody Models

论文作者

Cong, Gaoxiang, Li, Liang, Qi, Yuankai, Zha, Zhengjun, Wu, Qi, Wang, Wenyu, Jiang, Bin, Yang, Ming-Hsuan, Huang, Qingming

论文摘要

鉴于文本，视频剪辑和参考音频，电影配音（也称为Visual Voice Clone V2C）的任务旨在生成与使用所需的扬声器声音在视频中呈现的说话者情感相匹配的演讲。 V2C比传统的文本到语音任务更具挑战性，因为它还需要生成的语音以与视频中呈现的各种情绪和口语速度完全匹配。与以前的作品不同，我们提出了一部新颖的电影配音建筑，以通过分层韵律建模来解决这些问题，该建筑将视觉信息桥接到了来自三个方面的相应语音韵律：嘴唇，脸部和场景。具体而言，我们将唇部运动与语音持续时间保持一致，并通过基于价值和唤醒表达的注意力机制传达面部表情，并通过近期心理学发现启发。此外，我们设计了一种情感助推器，以捕捉全球视频场景的气氛。所有这些嵌入在一起都用于生成MEL光谱图，然后通过现有Vocoder转换为语音波。有关化学和V2C基准数据集的广泛实验结果证明了该方法的良好性能。源代码和训练有素的模型将向公众发布。

Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. V2C is more challenging than conventional text-to-speech tasks as it additionally requires the generated speech to exactly match the varying emotions and speaking speed presented in the video. Unlike previous works, we propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by recent psychology findings. Moreover, we design an emotion booster to capture the atmosphere from global video scenes. All these embeddings together are used to generate mel-spectrogram and then convert to speech waves via existing vocoder. Extensive experimental results on the Chem and V2C benchmark datasets demonstrate the favorable performance of the proposed method. The source code and trained models will be released to the public.

下载PDF全文

下载文献需遵守相关版权规定

论文标题