论文标题
ADL-MVDR:所有深度学习MVDR波束形式用于目标语音分离
ADL-MVDR: All deep learning MVDR beamformer for target speech separation
论文作者
论文摘要
语音分离算法通常用于将目标语音与其他干涉来源分开。但是,纯粹基于神经网络的语音分离系统通常会导致非线性失真对自动语音识别(ASR)系统有害。基于传统的掩模的最小差异无失真响应(MVDR)波束形式可用于最大程度地减少失真,但具有高度的残留噪声。此外,当通过神经网络共同训练时,参与常规MVDR溶液中涉及的矩阵操作(例如,矩阵反转)有时在数值上是不稳定的。在本文中,我们提出了一个新颖的所有深度学习MVDR框架,其中矩阵反转和特征值分解被两个经常性神经网络(RNN)取代,以同时解决这两个问题。所提出的方法可以大大降低残余噪声,同时通过利用RNN预测的框架范围的重量的重量来保持目标语音的不变。该系统是根据普通话视听语料库进行评估的,并与几个最新的(SOTA)语音分离系统进行了比较。实验结果证明了所提出的方法在几个客观指标和ASR准确性上的优越性。
Speech separation algorithms are often used to separate the target speech from other interfering sources. However, purely neural network based speech separation systems often cause nonlinear distortion that is harmful for automatic speech recognition (ASR) systems. The conventional mask-based minimum variance distortionless response (MVDR) beamformer can be used to minimize the distortion, but comes with high level of residual noise. Furthermore, the matrix operations (e.g., matrix inversion) involved in the conventional MVDR solution are sometimes numerically unstable when jointly trained with neural networks. In this paper, we propose a novel all deep learning MVDR framework, where the matrix inversion and eigenvalue decomposition are replaced by two recurrent neural networks (RNNs), to resolve both issues at the same time. The proposed method can greatly reduce the residual noise while keeping the target speech undistorted by leveraging on the RNN-predicted frame-wise beamforming weights. The system is evaluated on a Mandarin audio-visual corpus and compared against several state-of-the-art (SOTA) speech separation systems. Experimental results demonstrate the superiority of the proposed method across several objective metrics and ASR accuracy.