论文标题
从推理到世代:端到端完全自我监督的人类面孔
From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech
论文作者
论文摘要
这项工作寻求可能仅基于视听数据而没有任何人类标记的注释来产生人的面孔。为此,我们提出了一个多模式学习框架,该框架将推理阶段和生成阶段联系起来。首先,对推理网络进行了训练,以匹配两种不同方式之间的扬声器身份。然后,受过训练的推理网络通过提供有关声音的有条件信息与生成网络合作。所提出的方法利用了最新的gan技术发展,并直接从语音波形中产生人脸,使我们的系统完全端到端。我们分析网络可以自然地解散有助于产生面部图像的两个潜在因素的程度 - 一个直接来自语音信号,另一个来自与其无关的潜在因素 - 并探索网络是否可以通过对这些因素进行建模来学会生成自然的人脸图像分布。实验结果表明,所提出的网络不仅可以与人脸和语音之间的关系匹配,而且还可以在其语音条件下产生高质量的人脸样本。最后,定量测量生成的面部与相应语音之间的相关性,以分析两种模式之间的关系。
This work seeks the possibility of generating the human face from voice solely based on the audio-visual data without any human-labeled annotations. To this end, we propose a multi-modal learning framework that links the inference stage and generation stage. First, the inference networks are trained to match the speaker identity between the two different modalities. Then the trained inference networks cooperate with the generation network by giving conditional information about the voice. The proposed method exploits the recent development of GANs techniques and generates the human face directly from the speech waveform making our system fully end-to-end. We analyze the extent to which the network can naturally disentangle two latent factors that contribute to the generation of a face image - one that comes directly from a speech signal and the other that is not related to it - and explore whether the network can learn to generate natural human face image distribution by modeling these factors. Experimental results show that the proposed network can not only match the relationship between the human face and speech, but can also generate the high-quality human face sample conditioned on its speech. Finally, the correlation between the generated face and the corresponding speech is quantitatively measured to analyze the relationship between the two modalities.