利用电话面具培训进行语音减少示意力E2E uyghur语音识别

论文标题

利用电话面具培训进行语音减少示意力E2E uyghur语音识别

Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition

论文作者

Ma, Guodong, Hu, Pengfei, Kang, Jian, Huang, Shen, Huang, Hao

论文摘要

在Uyghur的言论中，经常会遇到辅音和元音减少，尤其是在具有高语音率的自发言语中，这会导致语音识别表现的退化。为了解决这个问题，我们为基于构象异构体的端到端（E2E）语音识别提出了一种有效的电话面膜训练方法。这个想法是在模型训练过程中随机掩盖手机的某些百分比特征，该培训模拟了上述言语现象，并促进E2E模型以了解更多上下文信息。根据实验，可以极大地缓解上述问题。此外，对掩蔽的不同单位进行了深入研究，这表明了我们提议的掩蔽单元的有效性。我们还进一步研究了掩盖方法并优化了电话面膜的填充策略。最后，与基于构象的E2E基线相比，我们的模型分别显示出对阅读语音的相对单词错误率（WER）的约5.51％和自发语音的12.92％。上述方法还在开源数据Thuyg-20的测试集上进行了验证，该方法显示了20％的相对改进。

In Uyghur speech, consonant and vowel reduction are often encountered, especially in spontaneous speech with high speech rate, which will cause a degradation of speech recognition performance. To solve this problem, we propose an effective phone mask training method for Conformer-based Uyghur end-to-end (E2E) speech recognition. The idea is to randomly mask off a certain percentage features of phones during model training, which simulates the above verbal phenomena and facilitates E2E model to learn more contextual information. According to experiments, the above issues can be greatly alleviated. In addition, deep investigations are carried out into different units in masking, which shows the effectiveness of our proposed masking unit. We also further study the masking method and optimize filling strategy of phone mask. Finally, compared with Conformer-based E2E baseline without mask training, our model demonstrates about 5.51% relative Word Error Rate (WER) reduction on reading speech and 12.92% on spontaneous speech, respectively. The above approach has also been verified on test-set of open-source data THUYG-20, which shows 20% relative improvements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题