论文标题
使用基于注意的扬声器记忆进行端到端ASR的无监督扬声器改编
Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR
论文作者
论文摘要
我们提出了一种不受监督的说话者适应方法,灵感来自端到端的神经图灵机器(E2E)自动语音识别(ASR)。所提出的模型包含一个存储器块,该内存块使扬声器I-向量从训练数据中提取,并通过注意机制从内存中读取相关的I-向量。所得的内存矢量(M-vector)与E2E神经网络模型的声学特征或隐藏层激活相连。 E2E ASR系统基于联合连接派时间分类和基于注意的编码器架构。比较了使用WSJ和TED-LIUM2 ASR基准在编码器神经网络的不同层上插入M-Vector和I矢量结果。我们表明,与单个扬声器的I-vector相比,在测试时间不需要辅助扬声器嵌入提取系统的M-向量,并且具有类似的单词错误率(WERS),并且对于有说话者变化的话语而言,它的发言人明显较低。
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoder-decoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.