论文标题
与混合指针网络语言模型进行上下文化ASR晶格恢复
Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model
论文作者
论文摘要
在社交媒体上上传的视频通常伴随文本描述。在为视频构建自动语音识别(ASR)系统时,我们可以利用此类视频元数据提供的上下文信息。在本文中,我们通过有选择地参与视频描述来探讨ASR晶格撤回。我们首先使用基于注意力的方法来提取视频元数据的上下文矢量表示,并将这些表示形式用作晶格撤销期间神经语言模型的输入的一部分。其次,我们提出了一种混合指针网络方法,以明确插入元数据中单词出现的单词概率。我们对语言建模和ASR任务进行实验评估,并证明两种建议的方法通过选择性利用视频元数据提供了性能改进。
Videos uploaded on social media are often accompanied with textual descriptions. In building automatic speech recognition (ASR) systems for videos, we can exploit the contextual information provided by such video metadata. In this paper, we explore ASR lattice rescoring by selectively attending to the video descriptions. We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model during lattice rescoring. Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata. We perform experimental evaluations on both language modeling and ASR tasks, and demonstrate that both proposed methods provide performance improvements by selectively leveraging the video metadata.