论文标题
端到端自动语音识别中语言模型融合的密度比方法融合
A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition
论文作者
论文摘要
本文介绍了将外部语言模型(LMS)集成到自动语音识别(ASR)端到端模型中的密度比率方法。应用于在给定域,匹配的域内RNN-LM和目标域RNN-LM培训的复发性神经网络传感器(RNN-T)ASR模型,该方法使用贝叶斯的规则定义了基于ASM的RNN-T portsteriors的RNN-T portsteriors,以与经典型号相似,以与经典的herbrid Manders相似,以实现ASM的lys ne ne ne hebrid Mander(Asbrid ands)(earders ne ne ne anders)(a ass)。马尔可夫模型(HMM)框架(Bourlard&Morgan,1994)。提出的方法在跨域和有限的数据方案中进行了评估,为LM培训,使用大量目标域文本数据,但仅使用有限(或无){AUDIO,TRAMSICTINCER}培训数据对来训练RNN-T。具体而言,对YouTube的配对音频和成绩单数据训练的RNN-T模型的概括能够概括为语音搜索数据。发现密度比方法始终优于LM和端到端ASR整合的主要方法,即浅融合。
This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR). Applied to a Recurrent Neural Network Transducer (RNN-T) ASR model trained on a given domain, a matched in-domain RNN-LM, and a target domain RNN-LM, the proposed method uses Bayes' Rule to define RNN-T posteriors for the target domain, in a manner directly analogous to the classic hybrid model for ASR based on Deep Neural Networks (DNNs) or LSTMs in the Hidden Markov Model (HMM) framework (Bourlard & Morgan, 1994). The proposed approach is evaluated in cross-domain and limited-data scenarios, for which a significant amount of target domain text data is used for LM training, but only limited (or no) {audio, transcript} training data pairs are used to train the RNN-T. Specifically, an RNN-T model trained on paired audio & transcript data from YouTube is evaluated for its ability to generalize to Voice Search data. The Density Ratio method was found to consistently outperform the dominant approach to LM and end-to-end ASR integration, Shallow Fusion.