看起来增强听力：使用图像恢复缺少的语音

论文标题

看起来增强听力：使用图像恢复缺少的语音

Looking Enhances Listening: Recovering Missing Speech Using Images

论文作者

Srinivasan, Tejas, Sanabria, Ramon, Metze, Florian

论文摘要

通过使用视觉上下文可以更好地理解语音。因此，有许多尝试使用图像来调整自动语音识别（ASR）系统的尝试。但是，当前的工作表明，在视觉上适应的ASR模型仅将图像用作正则化信号，同时完全忽略了其语义内容。在本文中，我们介绍了一组实验，其中我们在嘈杂条件下显示了视觉模态的实用性。我们的结果表明，多模式ASR模型可以通过使用视觉表示将其转录接地，可以恢复输入声信号中掩盖的单词。我们观察到，整合视觉上下文可能会导致蒙版单词恢复的相对改善最高35％。这些结果表明，端到端的多模式ASR系统可以通过利用视觉上下文来对噪声变得更加强大。

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

下载PDF全文

下载文献需遵守相关版权规定

论文标题