使用半监督学习的多渠道声学建模的完全可学习的前端

论文标题

使用半监督学习的多渠道声学建模的完全可学习的前端

Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

论文作者

Wager, Sanna, Khare, Aparna, Wu, Minhua, Kumatani, Kenichi, Sundaram, Shiva

论文摘要

在这项工作中，我们调查了教师学生培训范式，以培训完全可以学习的多通道声学模型，以供远场自动语音识别（ASR）。我们使用在横梁形成音频训练的大型离线教师模型中，我们训练了语音识别系统中使用的更简单的多渠道学生声学模型。对于学生来说，使用教师模型的逻辑共同培训了多通道特征提取层和更高的分类层。在我们的实验中，与在大约600小时的转录数据中训练的基线模型相比，相对单词率（WER）降低了约27.3％的速度，同时使用了另外1800小时的未转录数据。我们还研究了预训练多通道前端的好处，以使用L2损失输出光束型号熔炉滤光管库能量（LFBE）。我们发现，与直接用梁形和MEL滤波器库库系数初始初始初始初始初始初始初始初始初始初始初始初始初米配置，训练的训练率将单词错误率提高了10.7％。最后，与我们的基准相比，结合培训和教师培训的培训降低了31％。

In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained using the logits from the teacher model. In our experiments, compared to a baseline model trained on about 600 hours of transcribed data, a relative word-error rate (WER) reduction of about 27.3% was achieved when using an additional 1800 hours of untranscribed data. We also investigated the benefit of pre-training the multi-channel front end to output the beamformed log-mel filter bank energies (LFBE) using L2 loss. We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly initialized with a beamformer and mel-filter bank coefficients for the front end. Finally, combining pre-training and teacher-student training produces a WER reduction of 31% compared to our baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题