端到端语音识别的RNN-TransDucer的有效最小单词错误率培训

论文标题

端到端语音识别的RNN-TransDucer的有效最小单词错误率培训

Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition

论文作者

Guo, Jinxi, Tiwari, Gautam, Droppo, Jasha, Van Segbroeck, Maarten, Huang, Che-Wei, Stolcke, Andreas, Maas, Roland

论文摘要

在这项工作中，我们提出了一种新颖有效的最小单词错误率（MWER）训练方法，用于RNN-TransDucer（RNN-T）。与以前的有关该主题的工作不同，该主题执行了限量尺寸的梁搜索解码并为预期的编辑距离计算生成一致性分数，我们在提出的方法中，我们在N最佳列表中对每个假设的所有可能对齐方式进行了分析和总和。使用前回发算法有效地计算了假设概率得分和后传播梯度。此外，提出的方法使我们能够将解码和训练过程分离，因此我们可以对每个子集进行离线并行编码和MWER培训。实验结果表明，这种提出的半决方法可以加快现有方法的速度6倍，并在基线RNN-T模型上产生类似的改善（3.6％）。提出的MWER训练还可以有效地减少RNN-T模型引入的EOS添加EOS时，可以有效地减少RNN-T模型引入的高减小误差（减少9.2％）。如果我们使用提出的RNN-T逆转方法来重新排列假设并使用外部RNN-LM进行额外的撤退，则可以取得进一步的改进。最好的系统在真正的远场录音的英语测试中取得了5％的相对改进，而音乐域的话语则减少了11.6％。

In this work, we propose a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer (RNN-T). Unlike previous work on this topic, which performs on-the-fly limited-size beam-search decoding and generates alignment scores for expected edit-distance computation, in our proposed method, we re-calculate and sum scores of all the possible alignments for each hypothesis in N-best lists. The hypothesis probability scores and back-propagated gradients are calculated efficiently using the forward-backward algorithm. Moreover, the proposed method allows us to decouple the decoding and training processes, and thus we can perform offline parallel-decoding and MWER training for each subset iteratively. Experimental results show that this proposed semi-on-the-fly method can speed up the on-the-fly method by 6 times and result in a similar WER improvement (3.6%) over a baseline RNN-T model. The proposed MWER training can also effectively reduce high-deletion errors (9.2% WER-reduction) introduced by RNN-T models when EOS is added for endpointer. Further improvement can be achieved if we use a proposed RNN-T rescoring method to re-rank hypotheses and use external RNN-LM to perform additional rescoring. The best system achieves a 5% relative improvement on an English test-set of real far-field recordings and a 11.6% WER reduction on music-domain utterances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题