WAVECRN：有效的卷积复发性神经网络，用于端到端语音增强

论文标题

WAVECRN：有效的卷积复发性神经网络，用于端到端语音增强

WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement

论文作者

Hsieh, Tsun-An, Wang, Hsin-Min, Lu, Xugang, Tsao, Yu

论文摘要

由于简单的设计管道，语音增强（SE）的端到端（E2E）神经模型引起了极大的兴趣。为了提高E2E模型的性能，应在建模时有效考虑语音的局部性和时间顺序特性。但是，在当前的E2E模型中，这些属性要么不完全考虑或太复杂而无法实现。在本文中，我们提出了一个有效的E2E SE模型，称为WAVECRN。在WAVECRN中，语音局部性特征由卷积神经网络（CNN）捕获，而局部性特征的时间顺序属性则由堆叠的简单复发单元（SRU）建模。与使用较长的短期内存（LSTM）网络的常规时间顺序模型不同，很难并行化，SRU可以在计算中有效地平行，而模型参数更少。此外，为了更有效地抑制输入噪声语音中的噪声组件，我们得出了一种新颖的受限特征掩模（RFM）方法，该方法在隐藏层中的特征图上执行增强。这与在嘈杂的光谱特征上应用估计比率蒙版的方法不同，嘈杂的光谱特征通常用于语音分离方法。言语denoising和压缩语音恢复任务的实验结果证实，借助SRU的轻量级体系结构和基于功能映射的RFM，WAVECRN与其他最先进的方法相当，具有明显降低模型复杂性和推理时间。

Due to the simple design pipeline, end-to-end (E2E) neural models for speech enhancement (SE) have attracted great interest. In order to improve the performance of the E2E model, the locality and temporal sequential properties of speech should be efficiently taken into account when modelling. However, in most current E2E models for SE, these properties are either not fully considered or are too complex to be realized. In this paper, we propose an efficient E2E SE model, termed WaveCRN. In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU). Unlike a conventional temporal sequential model that uses a long short-term memory (LSTM) network, which is difficult to parallelize, SRU can be efficiently parallelized in calculation with even fewer model parameters. In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers; this is different from the approach that applies the estimated ratio mask on the noisy spectral features, which is commonly used in speech separation methods. Experimental results on speech denoising and compressed speech restoration tasks confirm that with the lightweight architecture of SRU and the feature-mapping-based RFM, WaveCRN performs comparably with other state-of-the-art approaches with notably reduced model complexity and inference time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题