论文标题
自我发作编码和汇总说话者认可
Self-attention encoding and pooling for speaker recognition
论文作者
论文摘要
移动设备的计算能力限制了最终用户应用程序,以存储尺寸,处理,内存和能耗。这些限制激发了研究人员的设计更有效的深层模型。另一方面,基于变压器体系结构的自我发挥网络由于其高平行功能和在各种自然语言处理(NLP)应用程序上的强大性能而引起了极大的兴趣。受到变压器的启发,我们提出了一个串联的自我发作编码和汇总(SAEP)机制,以获取嵌入非固定长度语音话语的歧视性说话者。 SAEP是一堆相同的块,仅依赖于自我注意力和位置馈线网络来创建说话者的向量表示。这种方法将短期扬声器的光谱特征编码到扬声器嵌入中,以用于与文本无关的扬声器验证中。我们已经在Voxceleb1和2数据集上评估了这种方法。所提出的体系结构能够超越基线X-vector,并根据卷积的其他基准表现出竞争性能,并显着降低了模型大小。与RESNET-34,RESNET-50和X-VECTOR相比,它使用94%,95%和73%的参数。这表明拟议的完全基于注意力的架构在从说话者话语中提取时间不变的功能方面更有效。
The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.