使用合成音频来提高端到端ASR系统中量量表的识别

论文标题

使用合成音频来提高端到端ASR系统中量量表的识别

Using Synthetic Audio to Improve The Recognition of Out-Of-Vocabulary Words in End-To-End ASR Systems

论文作者

Zheng, Xianrui, Liu, Yulan, Gunceler, Deniz, Willett, Daniel

论文摘要

如今，许多最先进的自动语音识别（ASR）系统应用全面模型，以完全数据驱动的方式沿着一个全球优化标准将音频映射到经过端到端的单词序列。这些模型允许在训练材料中代表的域和单词允许高精度ASR，但很难识别训练期间很少或根本没有代表的单词，即流行的单词和新命名实体。在本文中，我们使用文本对语音（TTS）引擎为播音外（OOV）单词提供合成音频。我们旨在通过使用额外的音频文本对，同时保持非OOV单词的性能，以提高OOV单词上复发性神经网络传感器（RNN-T）的识别精度。探索了不同的正则化技术，并通过在编码器上使用弹性重量合并（EWC）对RNN-T进行微调，从而实现最佳性能。这会产生57％的相对单词错误率（WER），对包含OOV单词的话语降低，整个测试集都没有任何降解。

Today, many state-of-the-art automatic speech recognition (ASR) systems apply all-neural models that map audio to word sequences trained end-to-end along one global optimisation criterion in a fully data driven fashion. These models allow high precision ASR for domains and words represented in the training material but have difficulties recognising words that are rarely or not at all represented during training, i.e. trending words and new named entities. In this paper, we use a text-to-speech (TTS) engine to provide synthetic audio for out-of-vocabulary (OOV) words. We aim to boost the recognition accuracy of a recurrent neural network transducer (RNN-T) on OOV words by using the extra audio-text pairs, while maintaining the performance on the non-OOV words. Different regularisation techniques are explored and the best performance is achieved by fine-tuning the RNN-T on both original training data and extra synthetic data with elastic weight consolidation (EWC) applied on the encoder. This yields a 57% relative word error rate (WER) reduction on utterances containing OOV words without any degradation on the whole test set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题