论文标题
关于混合自动回应传感器的最低单词错误率训练
On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer
论文作者
论文摘要
混合自动回旋传感器(HAT)是最近提出的端到端声学模型,它扩展了出于外部语言模型(LM)融合的目的,扩展了标准的复发神经网络传感器(RNN-T)。在HAT中,使用两个单独的概率分布估算了空白概率和标签概率,这为内部LM分数估计提供了更准确的解决方案,因此在与外部LM结合时可以更好地工作。先前的工作主要集中于具有负日志样式损失的HAT模型培训,而在本文中,我们研究了对HAT的最低单词错误率(MWER)培训 - 该标准更接近语音识别的评估指标,并已成功应用于其他类型的端点模型,例如序列到序列到序列到序列到序列(S2S)和RNN-T-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-t-T。从大约30,000个小时的训练数据的实验中,我们表明MWER训练可以提高HAT模型的准确性,同时,在推理过程中提高了模型对解码超参数(例如长度归一化和解码光束)的鲁棒性。
Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end acoustic model that extends the standard Recurrent Neural Network Transducer (RNN-T) for the purpose of the external language model (LM) fusion. In HAT, the blank probability and the label probability are estimated using two separate probability distributions, which provides a more accurate solution for internal LM score estimation, and thus works better when combining with an external LM. Previous work mainly focuses on HAT model training with the negative log-likelihood loss, while in this paper, we study the minimum word error rate (MWER) training of HAT -- a criterion that is closer to the evaluation metric for speech recognition, and has been successfully applied to other types of end-to-end models such as sequence-to-sequence (S2S) and RNN-T models. From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models, while at the same time, improving the robustness of the model against the decoding hyper-parameters such as length normalization and decoding beam during inference.