音频视听语音与深度学习有关

论文标题

音频视听语音与深度学习有关

Audio-Visual Speech Inpainting with Deep Learning

论文作者

Morrone, Giovanni, Michelsanti, Daniel, Tan, Zheng-Hua, Jensen, Jesper

论文摘要

在本文中，我们提出了一个基于深度学习的框架，用于视听语音介绍的框架，即，从可靠的音频上下文中恢复声学信号的缺失部分的任务和未腐败的视觉信息。最近的工作仅着重于仅音频方法，并且通常旨在介绍音乐信号，而音乐信号的结构与语音高度不同。取而代之的是，我们对语音信号的绘制差距从100 ms到1600毫秒不等，以调查视力可以规定不同持续时间的差距的贡献。我们还尝试了一种多任务学习方法，其中学习了电话识别任务以及语音介绍。结果表明，当差距变得较大时，纯音频介绍方法的性能迅速降低，而拟议的视听方法可以合理地恢复缺失的信息。此外，我们表明多任务学习是有效的，尽管对性能的最大贡献来自愿景。

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题