论文标题
LM类和单词映射,用于端到端ASR中上下文偏见
Class LM and word mapping for contextual biasing in End-to-End ASR
论文作者
论文摘要
近年来,全神经,端到端(E2E)ASR系统对言语识别社区产生了迅速的兴趣。他们将语音输入转换为单个可训练的神经网络模型中的文本单元。在ASR中,许多话语都包含丰富的命名实体。该命名的实体可能是用户或位置特定的,并且在培训期间没有看到它们。单个模型使推断过程中使用动态上下文信息变得不灵活。在本文中,我们建议训练一个意识到的E2E模型,并允许梁搜索在推理过程中遍及上下文。我们还提出了一种简单的方法,以调整上下文FST和基本模型之间的成本差异。该算法能够将命名的实体话语降低57%,而常规话语的准确性降解很少。尽管E2E模型不需要发音字典,但使用现有的发音知识来提高准确性很有趣。在本文中,我们提出了一种算法,通过发音将稀有实体单词映射到通用单词,并将映射的单词视为识别过程中原始单词的替代形式。该算法进一步将命名实体话语的WER降低了31%。
In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community. They convert speech input to text units in a single trainable Neural Network model. In ASR, many utterances contain rich named entities. Such named entities may be user or location specific and they are not seen during training. A single model makes it inflexible to utilize dynamic contextual information during inference. In this paper, we propose to train a context aware E2E model and allow the beam search to traverse into the context FST during inference. We also propose a simple method to adjust the cost discrepancy between the context FST and the base model. This algorithm is able to reduce the named entity utterance WER by 57% with little accuracy degradation on regular utterances. Although an E2E model does not need pronunciation dictionary, it's interesting to make use of existing pronunciation knowledge to improve accuracy. In this paper, we propose an algorithm to map the rare entity words to common words via pronunciation and treat the mapped words as an alternative form to the original word during recognition. This algorithm further reduces the WER on the named entity utterances by another 31%.