论文标题
学习扬声器特定的口红到语音一代
Learning Speaker-specific Lip-to-Speech Generation
论文作者
论文摘要
对于普通人来说,了解唇部运动并从中推断出讲话是很困难的。准确的唇部阅读的任务从说话者的各种提示及其上下文或环境环境中获得帮助。每个演讲者都有不同的口音和说话风格,可以从他们的视觉和语音功能中推断出来。这项工作旨在了解语音和单个说话者在不受约束和大型词汇中的唇部运动顺序之间的相关性/映射。我们将帧序列建模为在自动编码器设置中的变压器之前,并学会了利用音频和视频的时间属性的关节嵌入。我们使用深度度量学习学习时间同步,该学习指导解码器与输入唇部运动同步生成语音。因此,预测性后部为我们提供了以说话者的说话风格产生的演讲。我们已经在网格和LIP2WAV化学讲座数据集上培训了模型,以评估单个扬声器在不受限制的自然环境中唇部运动的自然语音生成任务。使用人类评估的各种定性和定量指标进行了广泛的评估还表明,我们的方法在几乎所有评估指标上都优于lip2wav化学数据集(在不受约束的环境中的大词汇)(大型词汇),并且在网格数据集中的总体评估指标上的良好范围都优于整个评估指标。
Understanding the lip movement and inferring the speech from it is notoriously difficult for the common person. The task of accurate lip-reading gets help from various cues of the speaker and its contextual or environmental setting. Every speaker has a different accent and speaking style, which can be inferred from their visual and speech features. This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers in an unconstrained and large vocabulary. We model the frame sequence as a prior to the transformer in an auto-encoder setting and learned a joint embedding that exploits temporal properties of both audio and video. We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements. The predictive posterior thus gives us the generated speech in speaker speaking style. We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks from lip movement in an unconstrained natural setting. Extensive evaluation using various qualitative and quantitative metrics with human evaluation also shows that our method outperforms the Lip2Wav Chemistry dataset(large vocabulary in an unconstrained setting) by a good margin across almost all evaluation metrics and marginally outperforms the state-of-the-art on GRID dataset.