无监督的微调和自我训练的多种假设RNN-T损失神经传感器的损失

论文标题

无监督的微调和自我训练的多种假设RNN-T损失神经传感器的损失

Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and Self-training of Neural Transducer

论文作者

Do, Cong-Thanh, Li, Mohan, Doddipatla, Rama

论文摘要

本文提出了一种新的方法，可以使用未标记的语音数据进行反复的神经网络（RNN） - 扭转器（RNN-T）端到端（E2E）自动语音识别（ASR）系统进行无监督的微调和自我训练。传统系统使用未标记的音频数据时，使用ASR假设作为目标进行微调/自我训练，并且容易受到基本模型的ASR性能。在这里，为了减轻使用未标记数据时ASR误差的影响，我们提出了多种假设的RNN-T损失，该损失将多个ASR 1最佳假设纳入损失函数中。对于微调任务，在LibrisPeech上进行的ASR实验表明，与test_other集中的单型方法相比，与单类假设方法相比，多多种假设方法可实现14.2％的单词错误率（WER）。对于自训练任务，使用来自华尔街日记（WSJ），Aurora-4的监督数据以及Chime-4真实噪声数据作为未标记的数据培训ASR模型。与单假义方法相比，多种假设方法在Chime-4的单渠道真实噪声评估集上相对减少了3.3％。

This paper proposes a new approach to perform unsupervised fine-tuning and self-training using unlabeled speech data for recurrent neural network (RNN)-Transducer (RNN-T) end-to-end (E2E) automatic speech recognition (ASR) systems. Conventional systems perform fine-tuning/self-training using ASR hypothesis as the targets when using unlabeled audio data and are susceptible to the ASR performance of the base model. Here in order to alleviate the influence of ASR errors while using unlabeled data, we propose a multiple-hypothesis RNN-T loss that incorporates multiple ASR 1-best hypotheses into the loss function. For the fine-tuning task, ASR experiments on Librispeech show that the multiple-hypothesis approach achieves a relative reduction of 14.2% word error rate (WER) when compared to the single-hypothesis approach, on the test_other set. For the self-training task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data as unlabeled data. The multiple-hypothesis approach yields a relative reduction of 3.3% WER on the CHiME-4's single-channel real noisy evaluation set when compared with the single-hypothesis approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题