论文标题

基于审议模型的两次端到端语音识别

Deliberation Model Based Two-Pass End-to-End Speech Recognition

论文作者

Hu, Ke, Sainath, Tara N., Pang, Ruoming, Prabhavalkar, Rohit

论文摘要

端到端(E2E)模型在自动语音识别(ASR)方面取得了迅速的进展,并相对于常规模型进行了竞争性。为了进一步提高质量,已经提出了一种两次通行模型,以使用非流式的聆听,参加和咒语(LAS)模型来撤回流的假设,同时保持合理的延迟。该模型参与声学以撤销假设,而不是仅使用第一通道文本假设的一类神经校正模型。在这项工作中,我们建议使用审议网络同时介绍声学和首次假设。双向编码器用于从第一频道假设中提取上下文信息。与LAS在Google语音搜索(VS)任务中撤退相比,拟议的审议模型相对减少了12%,在适当的名词测试集中降低了23%。与大型传统模型相比,我们最佳模型对VS的表现相对较好21%。在计算复杂性方面,审议解码器的尺寸比LAS解码器大,因此需要在第二次通道解码中进行更多的计算。

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional model, our best model performs 21% relatively better for VS. In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源