朝着现实的视觉配音与异质来源

论文标题

朝着现实的视觉配音与异质来源

Towards Realistic Visual Dubbing with Heterogeneous Sources

论文作者

Xie, Tianyi, Liao, Liucheng, Bi, Cheng, Tang, Benlai, Yin, Xiang, Yang, Jianfei, Wang, Mingjie, Yao, Jiali, Zhang, Yang, Ma, Zejun

论文摘要

几次视觉配音的任务集中在于对任何会说话的头部视频进行任意语音输入的唇部运动。尽管当前方法的改进是中等的，但它们通常需要高质量的视频和音频数据源，从而导致未能充分利用异质数据。在实践中，在某些情况下，收集完美的同源数据可能是棘手的，例如，音频触发或图片启动视频。在本文中，为了探索这种数据并支持高保真性的几乎没有镜头的视觉配音。具体而言，我们的两个阶段范式采用了面部地标作为潜在表示的中间标志，并将唇部运动从现实说话的脑海的核心任务中删除。通过这种方式，我们的方法可以使用更多可用的异质数据独立利用训练语料库进行两阶段子网络。此外，由于分解，我们的框架可以对给定的会说话的头进行进一步的微调，从而在最终的合成结果中提供了更好的说话者身份。此外，提出的方法还可以将外观特征从其他方法传递到目标扬声器。广泛的实验结果表明，我们提出的方法在生成与最新艺术相比的语音同步的高度现实视频方面的优越性。

The task of few-shot visual dubbing focuses on synchronizing the lip movements with arbitrary speech input for any talking head video. Albeit moderate improvements in current approaches, they commonly require high-quality homologous data sources of videos and audios, thus causing the failure to leverage heterogeneous data sufficiently. In practice, it may be intractable to collect the perfect homologous data in some cases, for example, audio-corrupted or picture-blurry videos. To explore this kind of data and support high-fidelity few-shot visual dubbing, in this paper, we novelly propose a simple yet efficient two-stage framework with a higher flexibility of mining heterogeneous data. Specifically, our two-stage paradigm employs facial landmarks as intermediate prior of latent representations and disentangles the lip movements prediction from the core task of realistic talking head generation. By this means, our method makes it possible to independently utilize the training corpus for two-stage sub-networks using more available heterogeneous data easily acquired. Besides, thanks to the disentanglement, our framework allows a further fine-tuning for a given talking head, thereby leading to better speaker-identity preserving in the final synthesized results. Moreover, the proposed method can also transfer appearance features from others to the target speaker. Extensive experimental results demonstrate the superiority of our proposed method in generating highly realistic videos synchronized with the speech over the state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题