SPEX+：完整的时域扬声器提取网络

论文标题

SPEX+：完整的时域扬声器提取网络

SpEx+: A Complete Time Domain Speaker Extraction Network

论文作者

Ge, Meng, Xu, Chenglin, Wang, Longbiao, Chng, Eng Siong, Dang, Jianwu, Li, Haizhou

论文摘要

鉴于目标发言人的参考语音，发言人提取旨在从多词器环境中提取目标语音信号。我们最近提出了一个时间域解决方案SPEX，该解决方案避免了频域方法中的相位估计。不幸的是，SPEX并非完全是时间域解决方案，因为它执行用于扬声器提取的时间域语音，同时将频率域的扬声器嵌入为参考。时间域的分析窗口的大小和频域输入的大小也不同。这种不匹配会对系统性能产生不利影响。为了消除这种不匹配，我们提出了一个完整的时域扬声器提取解决方案，称为spex+。具体来说，我们将两个相同的语音编码器网络的权重绑定，一个用于编码器提取器decoder管道，另一个是扬声器编码器的一部分。实验表明，在WSJ0-2MIX-MIX-EXTR数据库的不同性别条件下，SPEX+在最先进的SPEX基线上实现了0.8dB和2.1db SDR的改进。

Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-domain speaker embedding as the reference. The size of the analysis window for time-domain and the size for frequency-domain input are also different. Such mismatch has an adverse effect on the system performance. To eliminate such mismatch, we propose a complete time-domain speaker extraction solution, that is called SpEx+. Specifically, we tie the weights of two identical speech encoder networks, one for the encoder-extractor-decoder pipeline, another as part of the speaker encoder. Experiments show that the SpEx+ achieves 0.8dB and 2.1dB SDR improvement over the state-of-the-art SpEx baseline, under different and same gender conditions on WSJ0-2mix-extr database respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题