用于扬声器验证的RESNEXT和RES2NET结构

论文标题

用于扬声器验证的RESNEXT和RES2NET结构

ResNeXt and Res2Net Structures for Speaker Verification

论文作者

Zhou, Tianyan, Zhao, Yong, Wu, Jian

论文摘要

基于重新连接的架构已被广泛采用，以提取与文本无关的说话者验证系统提取说话者的嵌入。通过引入与CNN的残留连接并标准化剩余块，Resnet结构能够训练深层网络以实现高度竞争性的识别性能。但是，当输入特征空间变得更加复杂时，简单地增加Resnet网络的深度和宽度可能无法完全意识到其性能潜力。在本文中，我们介绍了Resnet Architecture，Resnext和Res2net的两个扩展，以供扬声器验证。 RESNEXT和RES2NET最初提出了图像识别，除了深度和宽度外，还提出了另外两个维度，即基数和规模，以提高模型的表示能力。通过增加比例维度，RES2NET模型可以代表具有各种粒度的多尺度特征，这特别促进了扬声器验证的简短话语。我们在三个说话者验证任务上评估了我们提出的系统。 Voxceleb测试集上的实验表明，Resnext和Res2net可以显着优于常规的重新系统模型。 RES2NET模型通过将EER降低18.5％，从而实现了卓越的性能。对不匹配条件的其他两个内部测试集进行的实验进一步证实了Resnext和Res2Net架构对嘈杂的环境和段长度变化的概括。

The ResNet-based architecture has been widely adopted to extract speaker embeddings for text-independent speaker verification systems. By introducing the residual connections to the CNN and standardizing the residual blocks, the ResNet structure is capable of training deep networks to achieve highly competitive recognition performance. However, when the input feature space becomes more complicated, simply increasing the depth and width of the ResNet network may not fully realize its performance potential. In this paper, we present two extensions of the ResNet architecture, ResNeXt and Res2Net, for speaker verification. Originally proposed for image recognition, the ResNeXt and Res2Net introduce two more dimensions, cardinality and scale, in addition to depth and width, to improve the model's representation capacity. By increasing the scale dimension, the Res2Net model can represent multi-scale features with various granularities, which particularly facilitates speaker verification for short utterances. We evaluate our proposed systems on three speaker verification tasks. Experiments on the VoxCeleb test set demonstrated that the ResNeXt and Res2Net can significantly outperform the conventional ResNet model. The Res2Net model achieved superior performance by reducing the EER by 18.5% relative. Experiments on the other two internal test sets of mismatched conditions further confirmed the generalization of the ResNeXt and Res2Net architectures against noisy environment and segment length variations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题