论文标题
双路线变压器网络:端到端单声道分离的直接上下文感知建模
Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation
论文作者
论文摘要
主导的语音分离模型基于复杂的复发或卷积神经网络,该网络模拟语音序列在上下文上间接调节,例如通过复发性神经网络中的许多中间状态传递信息,从而导致次优的分离性能。在本文中,我们提出了一个双路线变压器网络(DPTNET),用于端到端语音分离,该网络在语音序列的建模中引入了直接上下文意识。通过引入改进的变压器,语音序列中的元素可以直接相互作用,这使dptnet可以以直接上下文意识为语音序列建模。我们方法中的改进变压器通过将复发性神经网络纳入原始变压器中,以了解语音序列的顺序信息,而无需位置编码。此外,二元路径的结构使我们的模型有效地用于非常长的语音序列建模。基准数据集上的广泛实验表明,我们的方法的表现优于当前最新技术(公共WSJ0-2MIX数据语料库中的20.6 dB SDR)。
The dominant speech separation models are based on complex recurrent or convolution neural network that model speech sequences indirectly conditioning on context, such as passing information through many intermediate states in recurrent neural network, leading to suboptimal separation performance. In this paper, we propose a dual-path transformer network (DPTNet) for end-to-end speech separation, which introduces direct context-awareness in the modeling for speech sequences. By introduces a improved transformer, elements in speech sequences can interact directly, which enables DPTNet can model for the speech sequences with direct context-awareness. The improved transformer in our approach learns the order information of the speech sequences without positional encodings by incorporating a recurrent neural network into the original transformer. In addition, the structure of dual paths makes our model efficient for extremely long speech sequence modeling. Extensive experiments on benchmark datasets show that our approach outperforms the current state-of-the-arts (20.6 dB SDR on the public WSj0-2mix data corpus).