视觉演讲增强，没有真实的视觉流

论文标题

视觉演讲增强，没有真实的视觉流

Visual Speech Enhancement Without A Real Visual Stream

论文作者

Hegde, Sindhu B, Prajwal, K R, Mukhopadhyay, Rudrabha, Namboodiri, Vinay, Jawahar, C. V.

论文摘要

在这项工作中，我们重新考虑了在不受限制的现实环境中的语音增强任务。当前的最新方法仅使用音频流，并且在广泛的现实噪声中的性能受到限制。最近使用唇部运动作为其他提示提高了生成的语音质量而不是“仅音频”方法的质量。但是，这些方法不能用于视觉流不可靠或完全不存在的几种应用。我们通过利用语音驱动的唇部合成中的最新突破来提出一个新的范式来增强语音的范式。使用一个像教师网络这样的模型，我们训练一个强大的学生网络来产生准确的唇部运动，以掩盖噪声，从而充当“视觉噪声过滤器”。通过我们的伪流方式增强语音的可理解性与使用真实嘴唇的情况相当（<3％差异）。这意味着即使没有真实的视频流，我们也可以利用使用唇部运动的优势。我们使用定量指标和人类评估严格评估我们的模型。我们网站上包含定性比较的其他消融研究和演示视频清楚地说明了我们方法的有效性。我们提供了一个演示视频，清楚地说明了我们在网站上提出的方法的有效性：\ url {http://cvit.iiit.ac.in/research/project/projects/cvit-projects/cvit-projects/visual-speech-enhancement-with-enhancement-without-without-aout-a-real-visual-visual-stream}。代码和模型还发布了未来的研究：\ url {https://github.com/sindhu-hegde/pseudo-visual-visual-spech-denoisising}。

In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only" methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a "visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is comparable (< 3% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as human evaluations. Additional ablation studies and a demo video on our website containing qualitative comparisons and results clearly illustrate the effectiveness of our approach. We provide a demo video which clearly illustrates the effectiveness of our proposed approach on our website: \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/visual-speech-enhancement-without-a-real-visual-stream}. The code and models are also released for future research: \url{https://github.com/Sindhu-Hegde/pseudo-visual-speech-denoising}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题