论文标题
声学到字模型的模块化端到端自动语音识别框架
Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model
论文作者
论文摘要
端到端(E2E)系统在自动语音识别(ASR)中发挥了越来越重要的作用,并取得了出色的性能。但是,E2E系统直接识别输出单词序列,其输入声学功能只能在有限的声学数据上训练。额外的文本数据被广泛用于改善传统人工神经网络隐藏模型(ANN-HMM)混合系统的结果。将额外的文本数据涉及到标准E2E ASR系统可能会破坏解码过程中的E2E属性。在本文中,提出了一种新型的模块化E2E ASR系统。模块化E2E ASR系统由两个部分组成:声学到音量(A2P)模型和音素到字模型(P2W)模型。 A2P模型经过声学数据训练,而包括大型文本数据在内的额外数据可用于训练P2W模型。此附加数据使模块化E2E ASR系统不仅可以建模声学部分,还可以对语言部分进行建模。在解码阶段,这两个模型将集成并充当标准的声学对单词(A2W)模型。换句话说,提出的模块化E2E ASR系统可以轻松地使用额外的文本数据训练,并以与标准E2E ASR系统相同的方式进行解码。总机库中的实验结果表明,模块化E2E模型比标准A2W模型获得了更好的单词错误率(WER)。
End-to-end (E2E) systems have played a more and more important role in automatic speech recognition (ASR) and achieved great performance. However, E2E systems recognize output word sequences directly with the input acoustic feature, which can only be trained on limited acoustic data. The extra text data is widely used to improve the results of traditional artificial neural network-hidden Markov model (ANN-HMM) hybrid systems. The involving of extra text data to standard E2E ASR systems may break the E2E property during decoding. In this paper, a novel modular E2E ASR system is proposed. The modular E2E ASR system consists of two parts: an acoustic-to-phoneme (A2P) model and a phoneme-to-word (P2W) model. The A2P model is trained on acoustic data, while extra data including large scale text data can be used to train the P2W model. This additional data enables the modular E2E ASR system to model not only the acoustic part but also the language part. During the decoding phase, the two models will be integrated and act as a standard acoustic-to-word (A2W) model. In other words, the proposed modular E2E ASR system can be easily trained with extra text data and decoded in the same way as a standard E2E ASR system. Experimental results on the Switchboard corpus show that the modular E2E model achieves better word error rate (WER) than standard A2W models.