论文标题
在不利环境中用于短期扬声器验证的统一深度学习框架
A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments
论文作者
论文摘要
由于虚拟助手的普及,演讲者验证(SV)最近引起了相当大的研究兴趣。同时,对SV系统的需求越来越大:简短的语音段应该是健壮的,尤其是在嘈杂和回响的环境中。在本文中,我们考虑了对实际应用的一个更重要的要求:该系统应适用于包含长的非语音段的音频流,其中不应用语音活动检测(VAD)。为了满足这两个要求,我们介绍了基于特征金字塔模块(FPM)的多尺度聚合(MSA)和自适应软VAD(SAS-VAD)。我们介绍基于FPM的MSA,以处理嘈杂和混响环境中的简短语音段。另外,我们使用SAS-VAD来增加对长言论节段的鲁棒性。为了进一步提高声学扭曲的鲁棒性(即噪声和混响),我们采用了基于掩盖的语音增强(SE)方法。我们将SV,VAD和SE模型组合在一个统一的深度学习框架中,并以端到端的方式共同训练整个网络。据我们所知,这是将这三个模型结合在深度学习框架中的第一部作品。我们对韩国室内(KID)和Voxceleb数据集进行了实验,这些数据集被噪音和混响所破坏。结果表明,在具有挑战性的条件下,所提出的方法对SV有效,并且比基线I-Vector和Deep Seperser嵌入系统的性能更好。
Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these two requirements, we introduce feature pyramid module (FPM)-based multi-scale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present the FPM-based MSA to deal with short speech segments in noisy and reverberant environments. Also, we use the SAS-VAD to increase the robustness to long non-speech segments. To further improve the robustness to acoustic distortions (i.e., noise and reverberation), we apply a masking-based speech enhancement (SE) method. We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an end-to-end manner. To the best of our knowledge, this is the first work combining these three models in a deep learning framework. We conduct experiments on Korean indoor (KID) and VoxCeleb datasets, which are corrupted by noise and reverberation. The results show that the proposed method is effective for SV in the challenging conditions and performs better than the baseline i-vector and deep speaker embedding systems.