论文标题
U-Former:通过多头自我提高单声言语的增强和互相注意
U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention
论文作者
论文摘要
对于有监督的语音增强,上下文信息对于准确的光谱映射很重要。但是,常用的深神经网络(DNN)在捕获时间上下文时受到限制。为了利用长期背景来跟踪目标扬声器,本文将语音增强视为序列到序列映射,并提出一种基于变压器的新型单声道语音增强U-NET结构,称为U-NETER,称为U-Former。关键思想是建模长期的相关性和依赖性,这对于通过多头注意机制进行准确的嘈杂语音建模至关重要。为此,U-Former在两个级别上结合了多头注意的注意机制:1)多头自我关注的模块,该模块沿时间和频率轴都计算了注意力图,以生成时间和频率的子注意图,用于利用编码器之间的全局相互作用,而2)通过插入型号的多头跨意向模块,可以在Skip中插入式恢复恢复。实验结果表明,与最近的PESQ,Stoi和SSNR分数的模型相比,U形式的性能始终更好。
For supervised speech enhancement, contextual information is important for accurate spectral mapping. However, commonly used deep neural networks (DNNs) are limited in capturing temporal contexts. To leverage long-term contexts for tracking a target speaker, this paper treats the speech enhancement as sequence-to-sequence mapping, and propose a novel monaural speech enhancement U-net structure based on Transformer, dubbed U-Former. The key idea is to model long-term correlations and dependencies, which are crucial for accurate noisy speech modeling, through the multi-head attention mechanisms. For this purpose, U-Former incorporates multi-head attention mechanisms at two levels: 1) a multi-head self-attention module which calculate the attention map along both time- and frequency-axis to generate time and frequency sub-attention maps for leveraging global interactions between encoder features, while 2) multi-head cross-attention module which are inserted in the skip connections allows a fine recovery in the decoder by filtering out uncorrelated features. Experimental results illustrate that the U-Former obtains consistently better performance than recent models of PESQ, STOI, and SSNR scores.