使用基于CTC-ATT的L2英语说话者的方法改进了错误发音检测系统

论文标题

使用基于CTC-ATT的L2英语说话者的方法改进了错误发音检测系统

Improved Mispronunciation detection system using a hybrid CTC-ATT based approach for L2 English speakers

论文作者

Baranwal, Neha, Chilaka, Sharatkumar

论文摘要

该报告提出了计算机辅助语言学习领域的最先进研究（CALL）。错误发音检测是计算机辅助发音训练（Capt）系统的核心组件之一，该系统是呼叫的子集。关于自动发音误差检测的研究始于1990年代，但是由于计算能力的增加和移动设备的可用性增加了用于记录发音分析所需的语音语音的移动设备的可用性，因此在过去的十年中，完整的上尉的发展仅加速了。检测发音误差是一个难题，因为没有正式定义正确和不正确的发音。结果，检测到通常是韵律和音素误差，例如音素替代，插入和缺失。同样，已经同意学习发音应集中在说话者的清晰度上，而不是听起来像L1英语的人。最初，使用高斯混合模型模型马尔可夫模型和深神网络隐藏的马尔可夫模型方法开发了在后验可能性上开发的方法。与最近提出的基于ASR的端到端错误发音检测系统相比，这些是要实施的复杂系统。这项研究的目的是使用Connectionist暂时分类（CTC）和基于注意力的序列解码器创建端到端（E2E）模型。最近，E2E模型在错误发音检测准确性方面显示出显着提高。这项研究将与基于字符序列的注意解码器以及与基于音素的解码器系统的基线模型CNN-RNN-CTC，CNN-RNN-CTC和CNN-RNN-CTC进行比较。这项研究将有助于我们决定开发有效的错误发音检测系统的更好方法。

This report proposes state-of-the-art research in the field of Computer Assisted Language Learning (CALL). Mispronunciation detection is one of the core components of Computer Assisted Pronunciation Training (CAPT) systems which is a subset of CALL. Studies on automated pronunciation error detection began in the 1990s, but the development of fullfledged CAPTs has only accelerated in the last decade due to an increase in computing power and availability of mobile devices for recording speech required for pronunciation analysis. Detecting Pronunciation errors is a hard problem to solve as there is no formal definition of correct and incorrect pronunciation. As a result, typically prosodic and phoneme errors such as phoneme substitution, insertion, and deletion are detected. Also, it has been agreed upon that learning pronunciation should focus on speaker intelligibility rather than sounding like an L1 English speaker. Initially, methods were developed on posterior likelihood called Good of Pronunciation using Gaussian Mixture Model-Hidden Markov Model and Deep Neural Network-Hidden Markov Model approaches. These are complex systems to implement when compared with the recently proposed ASR based End-to-End mispronunciations detection systems. The purpose of this research is to create End-to-End (E2E) models using Connectionist Temporal Classification (CTC) and Attention-based sequence decoder. Recently, E2E models have shown considerable improvement in mispronunciation detection accuracy. This research will draw comparison amongst baseline models CNN-RNN-CTC, CNN-RNN-CTC with character sequence-based attention decoder, and CNN-RNN-CTC with phoneme-based decoder systems. This study will help us in deciding a better approach towards developing an efficient mispronunciation detection system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题