改进了使用多语言转移的零资源语言的声词嵌入

论文标题

改进了使用多语言转移的零资源语言的声词嵌入

Improved acoustic word embeddings for zero-resource languages using multilingual transfer

论文作者

Kamper, Herman, Matusevych, Yevgen, Goldwater, Sharon

论文摘要

声词嵌入是可变长度语音段的固定维表示。当无法传统的语音识别时，这种嵌入可以构成语音搜索，索引和发现系统的基础。在未标记的语音是唯一可用的资源的零资源设置中，我们需要一种可以在任意语言上提供强大嵌入的方法。在这里，我们探索多语言转移：我们在多种资源良好的语言的标签数据上训练单个监督的嵌入模型，然后将其应用于看不见的零资源语言。我们考虑了三种多语言复发性神经网络（RNN）模型：对所有培训语言的联合词汇训练的分类器；一个暹罗人接受了训练，可以区分与多种语言的相同单词和不同的单词。以及对通信自动编码器（CAE）RNN进行了训练，可以重建单词对。在六种目标语言上的单词歧视任务中，所有这些模型都优于以零资源语言本身训练的最先进的无监督的模型，从而平均提供了30％以上的相对改进。当仅使用几种培训语言时，多语言CAE的性能更好，但是使用更多的培训语言，其他多语言模型的性能类似。使用更多的培训语言通常是有益的，但是在某些语言上的改进是微不足道的。我们提出了探测实验，这些实验表明CAE比其他多语言模型编码更多的语音，单词持续时间，语言标识和说话者信息。

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. Such embeddings can form the basis for speech search, indexing and discovery systems when conventional speech recognition is not possible. In zero-resource settings where unlabelled speech is the only available resource, we need a method that gives robust embeddings on an arbitrary language. Here we explore multilingual transfer: we train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zero-resource languages. We consider three multilingual recurrent neural network (RNN) models: a classifier trained on the joint vocabularies of all training languages; a Siamese RNN trained to discriminate between same and different words from multiple languages; and a correspondence autoencoder (CAE) RNN trained to reconstruct word pairs. In a word discrimination task on six target languages, all of these models outperform state-of-the-art unsupervised models trained on the zero-resource languages themselves, giving relative improvements of more than 30% in average precision. When using only a few training languages, the multilingual CAE performs better, but with more training languages the other multilingual models perform similarly. Using more training languages is generally beneficial, but improvements are marginal on some languages. We present probing experiments which show that the CAE encodes more phonetic, word duration, language identity and speaker information than the other multilingual models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题