论文标题
对齐受限制的流循环神经网络传感器
Alignment Restricted Streaming Recurrent Neural Network Transducer
论文作者
论文摘要
对语音群落开发自动语音识别(ASR)应用的复发性神经网络传感器(RNN-T)模型的兴趣越来越大。 RNN-T经过训练,其损失功能不强制训练成绩单和音频的时间对齐。结果,使用单向长期短期内存(LSTM)编码器构建的RNN-T模型倾向于等待更长的输入音频跨度,然后才能流到已经解码的ASR令牌。在这项工作中,我们提出了对RNN-T损耗函数的修改,并开发了对齐限制RNN-T(AR-RNN-T)模型,该模型利用音频文本对齐信息来指导损失计算。我们将所提出的方法与现有作品(例如单调RNN-T)在LibrisPeech和内部数据集进行了比较。我们表明,AR-RNN-T损失提供了精致的控制,以导航令牌排放延迟与单词错误率(WER)之间的权衡。 AR-RNN-T模型还通过确保在任何给定的延迟范围内保证令牌排放来改善下游应用程序,例如ASR端点。此外,AR-RNN-T损失可为我们的LSTM模型体系结构提供更大的批量尺寸,并增加4倍的吞吐量,从而使GPU上的训练更快和收敛。
There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for longer spans of input audio, before streaming already decoded ASR tokens. In this work, we propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text alignment information to guide the loss computation. We compare the proposed method with existing works, such as monotonic RNN-T, on LibriSpeech and in-house datasets. We show that the Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER). The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency. Moreover, the Ar-RNN-T loss allows for bigger batch sizes and 4 times higher throughput for our LSTM model architecture, enabling faster training and convergence on GPUs.