论文标题
Vararray遇到T-SOT:推进流媒体遥远的对话言语识别的艺术状态
VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition
论文作者
论文摘要
本文提出了一个新型的流媒体自动语音识别(ASR)框架,该框架是由带有任意几何形状的遥远麦克风阵列捕获的多对话者重叠的语音。我们的框架被称为T-Sot-VA,利用独立开发了两种最近的技术。基于令牌级别的序列化输出训练(T-SOT),数量几何形状不合时宜的连续语音分离或VARARRARY和流媒体多对话者ASR。为了结合两种技术的最佳,我们新设计了一个基于T-SOT的ASR模型,该模型基于Vararray的两个分离的语音信号生成序列化的多对话者转录。我们还为这种ASR模型提出了一种预训练方案,在该方案中,我们基于单膜单键式ASR训练数据来模拟Vararray的输出信号。使用AMI会议语料库的对话转录实验表明,基于提议的框架的系统大大优于传统的框架。我们的系统分别在保留流媒体推理能力的同时,在多远离微载体设置中,AMI开发和评估集的最新单词错误率分别为13.7%和15.5%。
This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT). To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray. We also propose a pre-training scheme for such an ASR model where we simulate VarArray's output signals based on monaural single-talker ASR training data. Conversation transcription experiments using the AMI meeting corpus show that the system based on the proposed framework significantly outperforms conventional ones. Our system achieves the state-of-the-art word error rates of 13.7% and 15.5% for the AMI development and evaluation sets, respectively, in the multiple-distant-microphone setting while retaining the streaming inference capability.