语音识别的贝叶斯神经网络语言建模

论文标题

语音识别的贝叶斯神经网络语言建模

Bayesian Neural Network Language Modeling for Speech Recognition

论文作者

Xue, Boyang, Hu, Shoukang, Xu, Junhao, Geng, Mengzhe, Liu, Xunying, Meng, Helen

论文摘要

最新的神经网络语言模型（NNLMS）由长期记忆复发的神经网络（LSTM-RNN）和变压器表示非常复杂。当获得有限的培训数据时，它们容易过度拟合和泛化。为此，本文提出了一个总体完整的贝叶斯学习框架，其中包含三种方法，以说明LSTM-RNN和Transformer LMS的潜在不确定性。分别使用贝叶斯，高斯过程和变异LSTM-RNN或变压器LMS对其模型参数，神经激活的选择和隐藏输出表示的不确定性。有效的推理方法被用来自动选择使用神经体系结构搜索的最佳网络内部组件作为贝叶斯学习。还使用了最少数量的蒙特卡洛参数样本。这些允许贝叶斯NNLM培训和评估中产生的计算成本最小化。实验是针对两项任务进行的：AMI符合转录和牛津-BBC唇读句子2（LRS2）使用最先进的LF-MMI培训的有货运的TDNN系统，具有数据增强，扬声器适应和音频的多个播音机波束，用于重叠的演讲。基线LSTM-RNN和Transformer LMS具有估计的模型参数和辍学正则化的一致性改进，就困惑性和单词错误率（WER）获得了两项任务。特别是，在LRS2数据上，在贝叶斯NNLMS及其相应的基线之间的模型组合后，在基线LSTM-RNN和Transformer LMS上分别获得了极显着的绝对降低，最高可达1.3％和1.2％（相对12.1％和11.3％）。

State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. They are prone to overfitting and poor generalization when given limited training data. To this end, an overarching full Bayesian learning framework encompassing three methods is proposed in this paper to account for the underlying uncertainty in LSTM-RNN and Transformer LMs. The uncertainty over their model parameters, choice of neural activations and hidden output representations are modeled using Bayesian, Gaussian Process and variational LSTM-RNN or Transformer LMs respectively. Efficient inference approaches were used to automatically select the optimal network internal components to be Bayesian learned using neural architecture search. A minimal number of Monte Carlo parameter samples as low as one was also used. These allow the computational costs incurred in Bayesian NNLM training and evaluation to be minimized. Experiments are conducted on two tasks: AMI meeting transcription and Oxford-BBC LipReading Sentences 2 (LRS2) overlapped speech recognition using state-of-the-art LF-MMI trained factored TDNN systems featuring data augmentation, speaker adaptation and audio-visual multi-channel beamforming for overlapped speech. Consistent performance improvements over the baseline LSTM-RNN and Transformer LMs with point estimated model parameters and drop-out regularization were obtained across both tasks in terms of perplexity and word error rate (WER). In particular, on the LRS2 data, statistically significant WER reductions up to 1.3% and 1.2% absolute (12.1% and 11.3% relative) were obtained over the baseline LSTM-RNN and Transformer LMs respectively after model combination between Bayesian NNLMs and their respective baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题