在语言识别中建立相关性和序列建模

论文标题

在语言识别中建立相关性和序列建模

Towards Relevance and Sequence Modeling in Language Recognition

论文作者

Padi, Bharat, Mohan, Anand, Ganapathy, Sriram

论文摘要

在存在噪音的情况下，涉及同一语言家族的多个方言的自动语言识别任务（LID）是一个具有挑战性的问题。在这些情况下，语言/方言的身份只能可靠地存在于语音信号的时间顺序的某些部分。通过提取记录的长期统计摘要，假设特征框架的独立性，通过提取记录的长期统计摘要来忽略序列信息。在本文中，我们提出了一个神经网络框架，该框架利用语言识别中的简短信息。特别是，提出了一种新模型，以将相关性纳入语言识别，其中一部分语音数据与语言识别任务的相关性更加加权。使用注意建模的双向长短期内存（BLSTM）网络可以实现此相关权重。我们探讨了两种方法，第一种方法使用段级i-vector/x-vector表示，这些表示在神经模型中汇总，而第二种方法是直接在端到端神经模型中建模的声学特征。实验是使用NIST LRE 2017挑战中的语言识别任务，使用干净，嘈杂和多演讲者的语音数据以及大鼠语言识别语料库进行实验。在这些有关嘈杂的LRE任务以及大鼠数据集的实验中，所提出的方法对基于I-vector/x-vector的常规语言识别方法以及与其他包含序列信息的其他模型相比，进行了重大改进。

The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The conventional approaches to LID (and for speaker recognition) ignore the sequence information by extracting long-term statistical summary of the recording assuming an independence of the feature frames. In this paper, we propose a neural network framework utilizing short-sequence information in language recognition. In particular, a new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task. This relevance weighting is achieved using the bidirectional long short-term memory (BLSTM) network with attention modeling. We explore two approaches, the first approach uses segment level i-vector/x-vector representations that are aggregated in the neural model and the second approach where the acoustic features are directly modeled in an end-to-end neural model. Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data as well as in the RATS language recognition corpus. In these experiments on noisy LRE tasks as well as the RATS dataset, the proposed approach yields significant improvements over the conventional i-vector/x-vector based language recognition approaches as well as with other previous models incorporating sequence information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题