使用元学习设计神经扬声器嵌入

论文标题

使用元学习设计神经扬声器嵌入

Designing Neural Speaker Embeddings with Meta Learning

论文作者

Kumar, Manoj, Jin-Park, Tae, Bishop, Somer, Narayanan, Shrikanth

论文摘要

使用分类目标训练的神经扬声器嵌入在多个应用程序中表现出最先进的性能。通常，此类嵌入在单个任务的室外语料库上进行了培训，例如，扬声器分类，尽管有大量课程（扬声器）。在这项工作中，我们重新制定了元学习范式下的嵌入培训。我们将培训语料库重新分配为多个相关的说话者分类任务的合奏，并学习一种更好地看不见说话者的代表。首先，我们开发了一个开源工具包来训练X向量，该工具包与预先训练的Kaldi模型相匹配，以供说话者诊断和扬声器验证应用程序。我们发现该体系结构中不同的瓶颈层各种各样的偏爱不同的应用程序。接下来，我们使用两种元学习策略，即典型的网络和关系网络，以改进X矢量嵌入。我们最佳性能模型在Dihard II开发语料库和AMI Meeting Coppus上的相对提高了12.37％和7.11％。我们分析了Dihard语料库中不同领域的改进。值得注意的是，在具有挑战性的儿童语音领域，我们研究了儿童年龄与诊断表现之间的关系。此外，我们在SITW语料库（7.68％）上显示说话者验证的同等错误率降低（8.78％）。我们观察到，元学习尤其在这些语料库中遇到的具有挑战性的声学条件和录制设置中提供了好处。我们的实验说明了元学习作为一种广义学习范式在训练深神经扬声器嵌入中的适用性。

Neural speaker embeddings trained using classification objectives have demonstrated state-of-the-art performance in multiple applications. Typically, such embeddings are trained on an out-of-domain corpus on a single task e.g., speaker classification, albeit with a large number of classes (speakers). In this work, we reformulate embedding training under the meta-learning paradigm. We redistribute the training corpus as an ensemble of multiple related speaker classification tasks, and learn a representation that generalizes better to unseen speakers. First, we develop an open source toolkit to train x-vectors that is matched in performance with pre-trained Kaldi models for speaker diarization and speaker verification applications. We find that different bottleneck layers in the architecture variedly favor different applications. Next, we use two meta-learning strategies, namely prototypical networks and relation networks, to improve over the x-vector embeddings. Our best performing model achieves a relative improvement of 12.37% and 7.11% in speaker error on the DIHARD II development corpus and the AMI meeting corpus, respectively. We analyze improvements across different domains in the DIHARD corpus. Notably, on the challenging child speech domain, we study the relation between child age and the diarization performance. Further, we show reductions in equal error rate for speaker verification on the SITW corpus (7.68%) and the VOiCES challenge corpus (8.78%). We observe that meta-learning particularly offers benefits in challenging acoustic conditions and recording setups encountered in these corpora. Our experiments illustrate the applicability of meta-learning as a generalized learning paradigm for training deep neural speaker embeddings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题