论文标题
基于审议模型的两次端到端语音识别
Deliberation Model Based Two-Pass End-to-End Speech Recognition
论文作者
论文摘要
端到端(E2E)模型在自动语音识别(ASR)方面取得了迅速的进展,并相对于常规模型进行了竞争性。为了进一步提高质量,已经提出了一种两次通行模型,以使用非流式的聆听,参加和咒语(LAS)模型来撤回流的假设,同时保持合理的延迟。该模型参与声学以撤销假设,而不是仅使用第一通道文本假设的一类神经校正模型。在这项工作中,我们建议使用审议网络同时介绍声学和首次假设。双向编码器用于从第一频道假设中提取上下文信息。与LAS在Google语音搜索(VS)任务中撤退相比,拟议的审议模型相对减少了12%,在适当的名词测试集中降低了23%。与大型传统模型相比,我们最佳模型对VS的表现相对较好21%。在计算复杂性方面,审议解码器的尺寸比LAS解码器大,因此需要在第二次通道解码中进行更多的计算。
End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional model, our best model performs 21% relatively better for VS. In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding.