论文标题
关于大规模语音识别的流行端到端模型的比较
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition
论文作者
论文摘要
最近,从混合模型到端到端(E2E)模型进行自动语音识别的强烈推动。当前,有三种有希望的E2E方法:经过的神经网络传感器(RNN-T),基于RNN注意的编码器数据(AED)和变压器AED。在这项研究中,我们在非流传输和流模式中对RNN-T,RNN-AED和Transformer-AED模型进行了经验比较。我们使用6.5万小时的Microsoft匿名培训数据来培训这些模型。由于E2E模型越来越饥饿,因此最好将其有效性与大量培训数据进行比较。据我们所知,尚未进行这样的全面研究。我们表明,尽管在非流程模式下,AED模型比RNN-T强,但如果可以正确初始化其编码器,则RNN-T在流模式中非常有竞争力。在所有三种E2E模型中,Transformer-AED在流和非流传输模式下都达到了最佳精度。我们表明,与高度优化的混合模型相比,流式RNN-T和变压器AED模型都可以获得更好的准确性。
Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.