使用潜在分布匹配的随机说话面部生成

论文标题

使用潜在分布匹配的随机说话面部生成

Stochastic Talking Face Generation Using Latent Distribution Matching

论文作者

Yadav, Ravindra, Sardana, Ashish, Namboodiri, Vinay P, Hegde, Rajesh M

论文摘要

基于听到声音的说话面孔的视觉视觉的能力是独特的人类能力。最近已经为此能力解决了许多作品。我们通过基于单声音输入启用各种会说话的面部世代来不同。确实，只有产生单个说话的面孔的能力就会使系统本质上几乎是机器人。相比之下，我们无监督的随机音频到视频生成模型允许从单个音频输入中获得不同的世代。特别是，我们提出了一个无监督的随机音频到视频生成模型，该模型可以捕获视频分布的多种模式。我们确保所有不同的世代都是合理的。我们通过有原则的多模式变分自动编码器框架来做到这一点。我们证明了它在具有挑战性的LRW和网格数据集上的功效，并且比基线表现更好，同时具有生成多种不同唇部同步视频的能力。

The ability to envisage the visual of a talking face based just on hearing a voice is a unique human capability. There have been a number of works that have solved for this ability recently. We differ from these approaches by enabling a variety of talking face generations based on single audio input. Indeed, just having the ability to generate a single talking face would make a system almost robotic in nature. In contrast, our unsupervised stochastic audio-to-video generation model allows for diverse generations from a single audio input. Particularly, we present an unsupervised stochastic audio-to-video generation model that can capture multiple modes of the video distribution. We ensure that all the diverse generations are plausible. We do so through a principled multi-modal variational autoencoder framework. We demonstrate its efficacy on the challenging LRW and GRID datasets and demonstrate performance better than the baseline, while having the ability to generate multiple diverse lip synchronized videos.

下载PDF全文

下载文献需遵守相关版权规定

论文标题