使用空间分布的麦克风进行神经语音分离

论文标题

使用空间分布的麦克风进行神经语音分离

Neural Speech Separation Using Spatially Distributed Microphones

论文作者

Wang, Dongmei, Chen, Zhuo, Yoshioka, Takuya

论文摘要

本文提出了一种基于神经网络的语音分离方法，使用空间分布的麦克风。与传统的麦克风阵列设置不同，麦克风的数量及其空间排列都不是事先知道的，这会阻碍基于固定尺寸输入的常规多通道语音分离神经网络的使用。为了克服这一点，提出了一种新颖的网络体系结构，该架构交织了通道间处理层和时间处理层。通道间处理层沿通道维度采用自我发注意机制，以利用不同数量的麦克风获得的信息。时间处理层基于双向长期记忆（BLSTM）模型，并独立应用于每个通道。提出的网络通过交替堆叠这两种层来利用跨时间和空间的信息。我们的网络估计每个说话者的时频（TF）掩码，然后将其用于生成带有TF屏蔽或波束形成的增强语音信号。语音识别实验结果表明，所提出的方法显着胜过基线多通道语音分离系统。

This paper proposes a neural network based speech separation method using spatially distributed microphones. Unlike with traditional microphone array settings, neither the number of microphones nor their spatial arrangement is known in advance, which hinders the use of conventional multi-channel speech separation neural networks based on fixed size input. To overcome this, a novel network architecture is proposed that interleaves inter-channel processing layers and temporal processing layers. The inter-channel processing layers apply a self-attention mechanism along the channel dimension to exploit the information obtained with a varying number of microphones. The temporal processing layers are based on a bidirectional long short term memory (BLSTM) model and applied to each channel independently. The proposed network leverages information across time and space by stacking these two kinds of layers alternately. Our network estimates time-frequency (TF) masks for each speaker, which are then used to generate enhanced speech signals either with TF masking or beamforming. Speech recognition experimental results show that the proposed method significantly outperforms baseline multi-channel speech separation systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题