3M：用于语音识别的多损失，多路和多级神经网络

论文标题

3M：用于语音识别的多损失，多路和多级神经网络

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

论文作者

You, Zhao, Feng, Shulin, Su, Dan, Yu, Dong

论文摘要

最近，基于构象异构体的CTC/AED模型已成为ASR的主流体系结构。在本文中，基于我们的先前工作，我们确定并整合了几种方法，以实现ASR任务的进一步改进，我们将其表示为多损失，多路径和多层次，总结为“ 3M”模型。具体而言，多损失指的是联合CTC/AED损失，多路径表示可以有效地增加模型容量而不会显着增加计算成本的型号的混合物（MOE）体系结构。多层次意味着我们在深层模型的多个级别引入辅助损失，以帮助培训。我们在公共Wenetspeech数据集上评估了我们提出的方法，实验结果表明，该方法比Wenet Toolkit训练的基线模型提供了12.2％-17.6％的相对CER改进。在我们的150k小时语料库的大型数据集中，3M模型还显示出与基线构象模型相比明显的优势。代码可在https://github.com/tencent-ailab/3m-asr上公开获取。

Recently, Conformer based CTC/AED model has become a mainstream architecture for ASR. In this paper, based on our prior work, we identify and integrate several approaches to achieve further improvements for ASR tasks, which we denote as multi-loss, multi-path and multi-level, summarized as "3M" model. Specifically, multi-loss refers to the joint CTC/AED loss and multi-path denotes the Mixture-of-Experts(MoE) architecture which can effectively increase the model capacity without remarkably increasing computation cost. Multi-level means that we introduce auxiliary loss at multiple level of a deep model to help training. We evaluate our proposed method on the public WenetSpeech dataset and experimental results show that the proposed method provides 12.2%-17.6% relative CER improvement over the baseline model trained by Wenet toolkit. On our large scale dataset of 150k hours corpus, the 3M model has also shown obvious superiority over the baseline Conformer model. Code is publicly available at https://github.com/tencent-ailab/3m-asr.

下载PDF全文

下载文献需遵守相关版权规定

论文标题