多通卷积神经网络具有频率选择用于稳健扬声器验证的频率

论文标题

多通卷积神经网络具有频率选择用于稳健扬声器验证的频率

Multi-stream Convolutional Neural Network with Frequency Selection for Robust Speaker Verification

论文作者

Yao, Wei, Chen, Shen, Cui, Jiamin, Lou, Yaolin

论文摘要

说话者验证旨在验证输入语音是否与声称的说话者相对应，并且从传统上讲，这种系统是基于单流场景部署的，其中特征提取器在全频率范围内运行。在本文中，我们假设机器可以学习足够的知识来完成分类任务，而不是全频率范围，而不是全频率范围，这就是所谓的频率选择技术，并通过此技术使用此技术来实现扬声器验证任务的多流卷卷积神经网络（CNN）的新型框架。所提出的框架可容纳从多个流产生的各种时间嵌入，以增强声学建模的鲁棒性。对于时间嵌入的多样性，我们考虑使用频率选择的特征增强，即手动将频率分割为几个子带，并且每个流的特征提取器可以选择哪些子频段用作目标频域。与传统的单流解决方案不同，其中每种话语仅处理一次，在此框架中，有多个流并行处理。每个流的输入话语由指定频率范围内的频率选择器预处理，并通过平均归一化进行后处理。每个流的归一化时间嵌入将流入池层以生成融合的嵌入。我们在Voxceleb数据集上进行了广泛的实验，实验结果表明，多流CNN显着超过单流基线，最低决策成本函数（MIDCF）的相对改善的20.53％。

Speaker verification aims to verify whether an input speech corresponds to the claimed speaker, and conventionally, this kind of system is deployed based on single-stream scenario, wherein the feature extractor operates in full frequency range. In this paper, we hypothesize that machine can learn enough knowledge to do classification task when listening to partial frequency range instead of full frequency range, which is so called frequency selection technique, and further propose a novel framework of multi-stream Convolutional Neural Network (CNN) with this technique for speaker verification tasks. The proposed framework accommodates diverse temporal embeddings generated from multiple streams to enhance the robustness of acoustic modeling. For the diversity of temporal embeddings, we consider feature augmentation with frequency selection, which is to manually segment the full-band of frequency into several sub-bands, and the feature extractor of each stream can select which sub-bands to use as target frequency domain. Different from conventional single-stream solution wherein each utterance would only be processed for one time, in this framework, there are multiple streams processing it in parallel. The input utterance for each stream is pre-processed by a frequency selector within specified frequency range, and post-processed by mean normalization. The normalized temporal embeddings of each stream will flow into a pooling layer to generate fused embeddings. We conduct extensive experiments on VoxCeleb dataset, and the experimental results demonstrate that multi-stream CNN significantly outperforms single-stream baseline with 20.53 % of relative improvement in minimum Decision Cost Function (minDCF).

下载PDF全文

下载文献需遵守相关版权规定

论文标题