端到端可训练的自动训练浅网络，用于与文本无关的扬声器验证

论文标题

端到端可训练的自动训练浅网络，用于与文本无关的扬声器验证

End-to-End Trainable Self-Attentive Shallow Network for Text-Independent Speaker Verification

论文作者

Park, Hyeonmook, Park, Jungbae, Lee, Sang Wan

论文摘要

广义端到端（GE2E）模型被广泛用于说话者验证（SV）字段，因为其可扩展性和通用性与特定语言无关。但是，基于GE2E的长期术语内存（LSTM）具有两个局限性：首先，GE2E的嵌入梯度消失，这会导致长期输入序列的性能降解。其次，话语并未表示为正确固定的维矢量。在本文中，为了克服上述问题，我们为SV，端到端可训练的自我训练浅网络（SASN）提出了一个新颖的框架，其中包括时间播放神经网络（TDNN）和基于自动固定的X-vector阶段的自动汇总机制。我们证明了所提出的模型高效，并且比GE2E提供了更准确的扬声器验证。对于VCTK数据集（GE2E尺寸少于一半），提出的模型在EER（相等的错误率），DCF（检测成本函数）和AUC（曲线下的AUC）分别显示出大约63％，67％和85％的GE2E的显着性能提高。值得注意的是，当输入长度变长时，提出模型的DCF得分提高约为GE2E的17倍。

Generalized end-to-end (GE2E) model is widely used in speaker verification (SV) fields due to its expandability and generality regardless of specific languages. However, the long-short term memory (LSTM) based on GE2E has two limitations: First, the embedding of GE2E suffers from vanishing gradient, which leads to performance degradation for very long input sequences. Secondly, utterances are not represented as a properly fixed dimensional vector. In this paper, to overcome issues mentioned above, we propose a novel framework for SV, end-to-end trainable self-attentive shallow network (SASN), incorporating a time-delay neural network (TDNN) and a self-attentive pooling mechanism based on the self-attentive x-vector system during an utterance embedding phase. We demonstrate that the proposed model is highly efficient, and provides more accurate speaker verification than GE2E. For VCTK dataset, with just less than half the size of GE2E, the proposed model showed significant performance improvement over GE2E of about 63%, 67%, and 85% in EER (Equal error rate), DCF (Detection cost function), and AUC (Area under the curve), respectively. Notably, when the input length becomes longer, the DCF score improvement of the proposed model is about 17 times greater than that of GE2E.

下载PDF全文

下载文献需遵守相关版权规定

论文标题