基于审议模型的两次端到端语音识别

论文标题

基于审议模型的两次端到端语音识别

Deliberation Model Based Two-Pass End-to-End Speech Recognition

论文作者

Hu, Ke, Sainath, Tara N., Pang, Ruoming, Prabhavalkar, Rohit

论文摘要

端到端（E2E）模型在自动语音识别（ASR）方面取得了迅速的进展，并相对于常规模型进行了竞争性。为了进一步提高质量，已经提出了一种两次通行模型，以使用非流式的聆听，参加和咒语（LAS）模型来撤回流的假设，同时保持合理的延迟。该模型参与声学以撤销假设，而不是仅使用第一通道文本假设的一类神经校正模型。在这项工作中，我们建议使用审议网络同时介绍声学和首次假设。双向编码器用于从第一频道假设中提取上下文信息。与LAS在Google语音搜索（VS）任务中撤退相比，拟议的审议模型相对减少了12％，在适当的名词测试集中降低了23％。与大型传统模型相比，我们最佳模型对VS的表现相对较好21％。在计算复杂性方面，审议解码器的尺寸比LAS解码器大，因此需要在第二次通道解码中进行更多的计算。

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional model, our best model performs 21% relatively better for VS. In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题