通过联合培训框架进行封闭的复发融合

论文标题

通过联合培训框架进行封闭的复发融合

Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition

论文作者

Fan, Cunhang, Yi, Jiangyan, Tao, Jianhua, Tian, Zhengkun, Liu, Bin, Wen, Zhengqi

论文摘要

语音增强和识别方法的联合培训框架获得了良好的端到端自动语音识别（ASR）的良好表现。但是，这些方法仅利用增强功能作为语音识别组件的输入，受语音失真问题的影响。为了解决此问题，本文提出了一种封闭式的复发融合方法（GRF）方法，其中包含稳健端到端ASR的联合培训框架。 GRF算法用于动态组合嘈杂和增强的特征。因此，GRF不仅可以从增强功能中删除噪声信号，而且还可以从嘈杂的特征中学习原始的精细结构，从而可以减轻语音失真。提出的方法包括言语增强，GRF和语音识别。首先，基于掩模的语音增强网络用于增强输入语音。其次，将GRF应用于解决语音失真问题。第三，为了提高ASR的性能，最先进的语音变压器算法用作语音识别组件。最后，联合培训框架被用来同时优化这三个组件。我们的实验是在称为Aishell-1的开源普通话语音语料库上进行的。实验结果表明，所提出的方法仅使用增强特征就可以实现相对性格错误率（CER）降低10.04 \％。特别是对于低信噪比（0 dB），我们提出的方法可以通过减少12.67 \％CER来取得更好的性能，这表明我们提出的方法的潜力。

The joint training framework for speech enhancement and recognition methods have obtained quite good performances for robust end-to-end automatic speech recognition (ASR). However, these methods only utilize the enhanced feature as the input of the speech recognition component, which are affected by the speech distortion problem. In order to address this problem, this paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and enhanced features. Therefore, the GRF can not only remove the noise signals from the enhanced features, but also learn the raw fine structures from the noisy features so that it can alleviate the speech distortion. The proposed method consists of speech enhancement, GRF and speech recognition. Firstly, the mask based speech enhancement network is applied to enhance the input speech. Secondly, the GRF is applied to address the speech distortion problem. Thirdly, to improve the performance of ASR, the state-of-the-art speech transformer algorithm is used as the speech recognition component. Finally, the joint training framework is utilized to optimize these three components, simultaneously. Our experiments are conducted on an open-source Mandarin speech corpus called AISHELL-1. Experimental results show that the proposed method achieves the relative character error rate (CER) reduction of 10.04\% over the conventional joint enhancement and transformer method only using the enhanced features. Especially for the low signal-to-noise ratio (0 dB), our proposed method can achieves better performances with 12.67\% CER reduction, which suggests the potential of our proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题