关于通林语音识别的变压器传感器建模单元建模单元的研究

论文标题

关于通林语音识别的变压器传感器建模单元建模单元的研究

Research on Modeling Units of Transformer Transducer for Mandarin Speech Recognition

论文作者

Fu, Li, Li, Xiaoxiao, Zi, Libo

论文摘要

建模单元和模型体系结构是端到端语音识别中复发神经网络传感器（RNN-T）的两个关键因素。为了提高RNN-T对普通话识别任务的性能，提出了一种具有自我发明变压器和RNN组合结构的新型变压器传感器。然后探索了变压器传感器的不同建模单元的选择。此外，我们提出了一种新的混合宽训练方法，以获取能够同时准确识别不同采样率的普通话语音的通用模型。我们所有的实验均以8kHz和16kHz的抽样率进行约12,000小时的普通话语音进行。实验结果表明，使用音节和音调的普通话变压器传感器可实现最佳性能。与使用音节初始/最终语调和汉字相比，它平均得出的相对单词错误率（WER）降低了14.4％和44.1％。同样，它的表现优于基于音节初始/最终的模型，其平均相对字符错误率（CER）降低了13.5％。

Modeling unit and model architecture are two key factors of Recurrent Neural Network Transducer (RNN-T) in end-to-end speech recognition. To improve the performance of RNN-T for Mandarin speech recognition task, a novel transformer transducer with the combination architecture of self-attention transformer and RNN is proposed. And then the choice of different modeling units for transformer transducer is explored. In addition, we present a new mix-bandwidth training method to obtain a general model that is able to accurately recognize Mandarin speech with different sampling rates simultaneously. All of our experiments are conducted on about 12,000 hours of Mandarin speech with sampling rate in 8kHz and 16kHz. Experimental results show that Mandarin transformer transducer using syllable with tone achieves the best performance. It yields an average of 14.4% and 44.1% relative Word Error Rate (WER) reduction when compared with the models using syllable initial/final with tone and Chinese character, respectively. Also, it outperforms the model based on syllable initial/final with tone with an average of 13.5% relative Character Error Rate (CER) reduction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题