带有RNN-T的流媒体扬声器ASR

论文标题

带有RNN-T的流媒体扬声器ASR

Streaming Multi-speaker ASR with RNN-T

论文作者

Sklyar, Ilya, Piunova, Anna, Liu, Yulan

论文摘要

最近的研究表明，端到端的ASR系统可以识别来自多个演讲者的重叠语音。但是，所有已发表的作品都没有在推理期间都没有延迟约束，这对于大多数语音助手互动都不存在。这项工作着重于基于复发性神经网络传感器（RNN-T）的多演讲者语音识别，该透视剂已被证明在低潜伏在线识别方案下提供了高识别精度。我们研究了RNN-T多演讲者模型培训的两种方法：确定性输出目标分配和置换不变培训。我们表明，在前一种情况下，使用扬声器订单标签进行指导分离可以增强RNN-T的高级扬声器跟踪能力。除此之外，通过对单扬声器和多扬声器话语的多类培训，由此产生的模型在推断过程中对含糊不清的说话者获得了稳健性。我们的最佳模型在模拟的2扬声器LiblisPeech数据上达到了10.2％，这与先前报道的最先进的非启动模型（10.3％）竞争，而所提出的模型可以直接应用于流媒体应用程序。

Recent research shows end-to-end ASR systems can recognize overlapped speech from multiple speakers. However, all published works have assumed no latency constraints during inference, which does not hold for most voice assistant interactions. This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T) that has been shown to provide high recognition accuracy at a low latency online recognition regime. We investigate two approaches to multi-speaker model training of the RNN-T: deterministic output-target assignment and permutation invariant training. We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T. Apart from that, with multistyle training on single- and multi-speaker utterances, the resulting models gain robustness against ambiguous numbers of speakers during inference. Our best model achieves a WER of 10.2% on simulated 2-speaker LibriSpeech data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%), while the proposed model could be directly applied for streaming applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题