使用可变帧速率的基于注意的调理方法用于样式扬声器验证

论文标题

使用可变帧速率的基于注意的调理方法用于样式扬声器验证

Attention-based conditioning methods using variable frame rate for style-robust speaker verification

论文作者

Afshan, Amber, Alwan, Abeer

论文摘要

我们提出了一种提取说话者嵌入的方法，这些嵌入者对文本独立的说话者验证中的口语风格变化很强。通常，嵌入提取的扬声器包括培训DNN，以供扬声器分类以及使用瓶颈功能作为扬声器表示。这样的网络具有一个合并层，可以通过在所有话语框架上计算统计信息，以相等的权重来将框架级别转换为话语级特征。但是，自动锻炼的嵌入执行加权池，使其重量与框架分类任务中框架的重要性相对应。熵可以捕获由于说话样式变化而导致的声学变化。因此，提出了一个基于熵的变量帧速率向量作为自我发项层的外部条件向量，以向网络提供可以解决样式效应的信息。这项工作探讨了五种不同的调理方法。最佳的调理方法与门控的串联在12/23任务中为X-Vector基线提供了统计学上的显着改进，并且在使用UCLA扬声器可变性数据库时，与11/23任务中的基线相同。它在9/23任务中也没有调节，在1/23中也更糟。该方法还显示了SITW的多演讲者方案的显着改善。

We propose an approach to extract speaker embeddings that are robust to speaking style variations in text-independent speaker verification. Typically, speaker embedding extraction includes training a DNN for speaker classification and using the bottleneck features as speaker representations. Such a network has a pooling layer to transform frame-level to utterance-level features by calculating statistics over all utterance frames, with equal weighting. However, self-attentive embeddings perform weighted pooling such that the weights correspond to the importance of the frames in a speaker classification task. Entropy can capture acoustic variability due to speaking style variations. Hence, an entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer to provide the network with information that can address style effects. This work explores five different approaches to conditioning. The best conditioning approach, concatenation with gating, provided statistically significant improvements over the x-vector baseline in 12/23 tasks and was the same as the baseline in 11/23 tasks when using the UCLA speaker variability database. It also significantly outperformed self-attention without conditioning in 9/23 tasks and was worse in 1/23. The method also showed significant improvements in multi-speaker scenarios of SITW.

下载PDF全文

下载文献需遵守相关版权规定

论文标题