论文标题
内部语言模型估算域自适应端到端语音识别
Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition
论文作者
论文摘要
外部语言模型(LM)集成仍然是端到端(E2E)自动语音识别(ASR)的一项具有挑战性的任务,该任务在声学模型和语言模型之间没有明确的划分。在这项工作中,我们提出了一种内部LM估计方法(ILME)方法,以促进外部LM与所有预先存在的E2E模型更有效地整合,没有其他模型训练,包括最受欢迎的重复性神经网络传感器(RNNN-T)和基于注意的Excoder-Decoder-Decoder-Decoder(AED)模型。经过音频转录对培训,E2E模型隐含地学习了一个内部LM,该LM表征了源域中的训练数据。使用ILME,估算了E2E模型的内部LM分数,并从E2E模型和外部LM的分数之间的对数线性插值中减去。消除其声学成分时,内部LM分数近似为E2E模型的输出。 ILME可以减轻训练和测试之间的领域不匹配,或改善多域E2E ASR。 ILME通过经过30k训练的RNN-T和AED模型进行了实验,从浅层融合中,分别从浅层融合中降低了15.5%和6.8%的相对单词错误率,分别在室外和域中的Microsoft生产测试集上。
The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular recurrent neural network transducer (RNN-T) and attention-based encoder-decoder (AED) models. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the training data in the source domain. With ILME, the internal LM scores of an E2E model are estimated and subtracted from the log-linear interpolation between the scores of the E2E model and the external LM. The internal LM scores are approximated as the output of an E2E model when eliminating its acoustic components. ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR. Experimented with 30K-hour trained RNN-T and AED models, ILME achieves up to 15.5% and 6.8% relative word error rate reductions from Shallow Fusion on out-of-domain LibriSpeech and in-domain Microsoft production test sets, respectively.