时域扬声器提取网络

论文标题

时域扬声器提取网络

Time-domain speaker extraction network

论文作者

Xu, Chenglin, Rao, Wei, Chng, Eng Siong, Li, Haizhou

论文摘要

扬声器提取是从多对话的演讲中提取目标扬声器的声音。它模拟了人类的鸡尾酒会效果或选择性听力能力。先前的工作主要在频域中执行扬声器提取，然后以一定的相位近似重建信号。相位估计的不准确性是频域处理的固有的，这会影响信号重建的质量。在本文中，我们提出了一个时间域说话者提取网络（TSENET），该网络不会将语音信号分解为大小和相光谱，因此不需要相位估计。 TSENET由一堆扩张的深度可分离卷积网络组成，这些卷积网络捕获了语音信号的长距离依赖性，并具有可管理数量的参数。它还以扬声器i-vector为特征的目标扬声器的参考语音来进行选择性聆听目标扬声器的特征。实验表明，在公开评估条件下，所提出的TSENET在信噪比（SDR）（SDR）和言语质量（PESQ）的知觉评估方面，相对改善的相对改善为16.3％和7.0％。

Speaker extraction is to extract a target speaker's voice from multi-talker speech. It simulates humans' cocktail party effect or the selective listening ability. The prior work mostly performs speaker extraction in frequency domain, then reconstructs the signal with some phase approximation. The inaccuracy of phase estimation is inherent to the frequency domain processing, that affects the quality of signal reconstruction. In this paper, we propose a time-domain speaker extraction network (TseNet) that doesn't decompose the speech signal into magnitude and phase spectrums, therefore, doesn't require phase estimation. The TseNet consists of a stack of dilated depthwise separable convolutional networks, that capture the long-range dependency of the speech signal with a manageable number of parameters. It is also conditioned on a reference voice from the target speaker, that is characterized by speaker i-vector, to perform the selective listening to the target speaker. Experiments show that the proposed TseNet achieves 16.3% and 7.0% relative improvements over the baseline in terms of signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ) under open evaluation condition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题