聆听，查看和故意：使用预训练的文本视频表示形式的视觉上下文感知语音识别

论文标题

聆听，查看和故意：使用预训练的文本视频表示形式的视觉上下文感知语音识别

Listen, Look and Deliberate: Visual context-aware speech recognition using pre-trained text-video representations

论文作者

Ghorbani, Shahram, Gaur, Yashesh, Shi, Yu, Li, Jinyu

论文摘要

在这项研究中，我们试图解决利用视觉信号改善自动语音识别（ASR）的问题，也称为视觉上下文感知的ASR（VC-ASR）。我们探索了新颖的VC-ASR方法，以利用自我监管的预训练的文本视频嵌入模型提取的视频和文本表示。首先，我们提出了一个多流动注意体系结构，以利用音频和视频方式的信号。该体系结构由两种模式的单独编码器和一个围绕它们的单个解码器组成。我们表明，该体系结构比在信号级别融合方式更好。此外，我们还探索了第二通过模型中的视觉信息，该模型也称为“审议模型”。审议模型接受了第一次通过ASR的音频表示和文本假设，并将它们与视觉流相结合，以改善视觉上下文感知的识别。拟议的审议方案可以在任何训练有素的ASR之上起作用，还使我们能够利用预先训练的文本模型，以视觉特征将假设扎根。我们对2数据集的实验表明，多流和审议架构在VC-ASR任务中非常有效。我们评估了两种情况的拟议模型。清洁音频流和扭曲的音频，其中我们掩盖了音频中的一些特定单词。与纯音频模型相比，审议模型的表现分别优于多流模型，分别为清洁和掩盖数据的相对改善分别提高了6％和8.7％。审议模型还将蒙版单词恢复的相对相对59％。

In this study, we try to address the problem of leveraging visual signals to improve Automatic Speech Recognition (ASR), also known as visual context-aware ASR (VC-ASR). We explore novel VC-ASR approaches to leverage video and text representations extracted by a self-supervised pre-trained text-video embedding model. Firstly, we propose a multi-stream attention architecture to leverage signals from both audio and video modalities. This architecture consists of separate encoders for the two modalities and a single decoder that attends over them. We show that this architecture is better than fusing modalities at the signal level. Additionally, we also explore leveraging the visual information in a second pass model, which has also been referred to as a `deliberation model'. The deliberation model accepts audio representations and text hypotheses from the first pass ASR and combines them with a visual stream for an improved visual context-aware recognition. The proposed deliberation scheme can work on top of any well trained ASR and also enabled us to leverage the pre-trained text model to ground the hypotheses with the visual features. Our experiments on HOW2 dataset show that multi-stream and deliberation architectures are very effective at the VC-ASR task. We evaluate the proposed models for two scenarios; clean audio stream and distorted audio in which we mask out some specific words in the audio. The deliberation model outperforms the multi-stream model and achieves a relative WER improvement of 6% and 8.7% for the clean and masked data, respectively, compared to an audio-only model. The deliberation model also improves recovering the masked words by 59% relative.

下载PDF全文

下载文献需遵守相关版权规定

论文标题