论文标题
双重注意网络与文本无关的扬声器验证
Text-Independent Speaker Verification with Dual Attention Network
论文作者
论文摘要
本文介绍了针对文本独立的扬声器验证的新型注意模型设计。该模型采用了一对输入话语,并产生了话语级的嵌入,以表示每个话语中的说话者特定特征。如果输入话语来自同一扬声器,则预计它们将具有高度相似的嵌入。提出的注意力模型由一个自我发场模块和一个相互注意模块组成,该模块共同有助于产生话语水平的嵌入。自我发场的权重是根据话语本身计算的,而相互注意力的权重是通过其他话语在输入对中计算的。结果,每种话语都由自我发挥的加权嵌入和相互发注意力的加权嵌入表示。嵌入之间的相似性通过余弦距离得分和二进制分类器输出评分来衡量。整个模型(称为双重注意网络)是在Voxceleb数据库上端到端训练的。 Voxceleb 1测试集的评估结果表明,双重注意网络的表现明显优于基线系统。最佳结果的错误率相等1:6%。
This paper presents a novel design of attention model for text-independent speaker verification. The model takes a pair of input utterances and generates an utterance-level embedding to represent speaker-specific characteristics in each utterance. The input utterances are expected to have highly similar embeddings if they are from the same speaker. The proposed attention model consists of a self-attention module and a mutual attention module, which jointly contributes to the generation of the utterance-level embedding. The self-attention weights are computed from the utterance itself while the mutual-attention weights are computed with the involvement of the other utterance in the input pairs. As a result, each utterance is represented by a self-attention weighted embedding and a mutual-attention weighted embedding. The similarity between the embeddings is measured by a cosine distance score and a binary classifier output score. The whole model, named Dual Attention Network, is trained end-to-end on Voxceleb database. The evaluation results on Voxceleb 1 test set show that the Dual Attention Network significantly outperforms the baseline systems. The best result yields an equal error rate of 1:6%.