U-Former：通过多头自我提高单声言语的增强和互相注意

论文标题

U-Former：通过多头自我提高单声言语的增强和互相注意

U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention

论文作者

Xu, Xinmeng, Hao, Jianjun

论文摘要

对于有监督的语音增强，上下文信息对于准确的光谱映射很重要。但是，常用的深神经网络（DNN）在捕获时间上下文时受到限制。为了利用长期背景来跟踪目标扬声器，本文将语音增强视为序列到序列映射，并提出一种基于变压器的新型单声道语音增强U-NET结构，称为U-NETER，称为U-Former。关键思想是建模长期的相关性和依赖性，这对于通过多头注意机制进行准确的嘈杂语音建模至关重要。为此，U-Former在两个级别上结合了多头注意的注意机制：1）多头自我关注的模块，该模块沿时间和频率轴都计算了注意力图，以生成时间和频率的子注意图，用于利用编码器之间的全局相互作用，而2）通过插入型号的多头跨意向模块，可以在Skip中插入式恢复恢复。实验结果表明，与最近的PESQ，Stoi和SSNR分数的模型相比，U形式的性能始终更好。

For supervised speech enhancement, contextual information is important for accurate spectral mapping. However, commonly used deep neural networks (DNNs) are limited in capturing temporal contexts. To leverage long-term contexts for tracking a target speaker, this paper treats the speech enhancement as sequence-to-sequence mapping, and propose a novel monaural speech enhancement U-net structure based on Transformer, dubbed U-Former. The key idea is to model long-term correlations and dependencies, which are crucial for accurate noisy speech modeling, through the multi-head attention mechanisms. For this purpose, U-Former incorporates multi-head attention mechanisms at two levels: 1) a multi-head self-attention module which calculate the attention map along both time- and frequency-axis to generate time and frequency sub-attention maps for leveraging global interactions between encoder features, while 2) multi-head cross-attention module which are inserted in the skip connections allows a fine recovery in the decoder by filtering out uncorrelated features. Experimental results illustrate that the U-Former obtains consistently better performance than recent models of PESQ, STOI, and SSNR scores.

下载PDF全文

下载文献需遵守相关版权规定

论文标题