用于目标语音分离的神经时空时空波束形式

论文标题

用于目标语音分离的神经时空时空波束形式

Neural Spatio-Temporal Beamformer for Target Speech Separation

论文作者

Xu, Yong, Yu, Meng, Zhang, Shi-Xiong, Chen, Lianwu, Weng, Chao, Liu, Jianming, Yu, Dong

论文摘要

纯粹基于神经网络（NN）的语音分离和增强方法，尽管可以获得良好的客观分数，但不可避免地会导致非线性语音扭曲，这对自动语音识别（ASR）有害。另一方面，带有NN预测的掩模的最小差异无失真响应（MVDR）束缚器，尽管可以显着减少语音扭曲，但降低了降低能力。在本文中，我们提出了一个具有复杂价值的掩模的多-TAP MVDR光束器，用于语音分离和增强。与最新的基于NN掩码的MVDR光束器相比，多-TAP MVDR光束器还利用了框架间的相关性，除了在先前的艺术中已经使用的微粒间相关性。进一步的改进包括用复合价值面罩和复合面罩NN的联合训练更换实价的面具。对我们多模式的多渠道目标语音分离和增强平台的评估表明，我们提出的多-TAP MVDR光束器提高了ASR准确性和针对先前艺术的感知语音质量。

Purely neural network (NN) based speech separation and enhancement methods, although can achieve good objective scores, inevitably cause nonlinear speech distortions that are harmful for the automatic speech recognition (ASR). On the other hand, the minimum variance distortionless response (MVDR) beamformer with NN-predicted masks, although can significantly reduce speech distortions, has limited noise reduction capability. In this paper, we propose a multi-tap MVDR beamformer with complex-valued masks for speech separation and enhancement. Compared to the state-of-the-art NN-mask based MVDR beamformer, the multi-tap MVDR beamformer exploits the inter-frame correlation in addition to the inter-microphone correlation that is already utilized in prior arts. Further improvements include the replacement of the real-valued masks with the complex-valued masks and the joint training of the complex-mask NN. The evaluation on our multi-modal multi-channel target speech separation and enhancement platform demonstrates that our proposed multi-tap MVDR beamformer improves both the ASR accuracy and the perceptual speech quality against prior arts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题