使用大型文本语料库提高审议E2E ASR模型的尾巴性能

论文标题

使用大型文本语料库提高审议E2E ASR模型的尾巴性能

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

论文作者

Peyser, Cal, Mavandadi, Sepand, Sainath, Tara N., Apfel, James, Pang, Ruoming, Kumar, Shankar

论文摘要

端到端（E2E）自动语音识别（ASR）系统缺乏表征传统语音系统的独特语言模型（LM）组件。尽管这简化了模型体系结构，但它使将纯文本数据纳入培训的任务变得复杂，这对于识别音频文本对不经常发生的尾巴单词很重要。尽管已提出浅融合作为将预训练的LM掺入E2E模型中的一种方法，但尚未在非常大的文本语料库中探索它，并且已证明它对光束搜索中的超参数设置非常敏感。在这项工作中，我们将浅融合融合到将非常大的文本语料库纳入最新的E2easr模型中。我们探讨了模型大小的影响，并表明训练组的智能修剪比增加参数计数更有效。此外，我们表明，将LM纳入最小单词错误率（MWER）微调使浅融合的依赖程度少得多，从而减少了调整问题的难度。

End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct language model (LM) component that characterizes traditional speech systems. While this simplifies the model architecture, it complicates the task of incorporating text-only data into training, which is important to the recognition of tail words that do not occur often in audio-text pairs. While shallow fusion has been proposed as a method for incorporating a pre-trained LM into an E2E model at inference time, it has not yet been explored for very large text corpora, and it has been shown to be very sensitive to hyperparameter settings in the beam search. In this work, we apply shallow fusion to incorporate a very large text corpus into a state-of-the-art E2EASR model. We explore the impact of model size and show that intelligent pruning of the training set can be more effective than increasing the parameter count. Additionally, we show that incorporating the LM in minimum word error rate (MWER) fine tuning makes shallow fusion far less dependent on optimal hyperparameter settings, reducing the difficulty of that tuning problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题