论文标题
一种有效的端到端建模方法,用于错误发音检测
An Effective End-to-End Modeling Approach for Mispronunciation Detection
论文作者
论文摘要
最近,端到端(E2E)自动语音识别(ASR)系统与常规混合DNN-HMM ASR系统相比,由于其巨大的成功和统一的建模范例,引起了极大的关注。尽管在ASR上广泛采用了E2E建模框架,但仍缺乏研究用于计算机辅助发音学习(CAPT)的E2E框架(尤其是用于错误发音检测(MD)的框架)。作为回应,我们首先介绍了混合CTCATCTENTION方法在MD任务中的新颖使用,利用CTC的优势和基于注意力的模型的优势同时,同时围绕了对电话级强制对齐的需求。其次,我们使用文本提示信息执行输入增强,以使所得的E2E模型更加针对MD任务量身定制。另一方面,我们采用两种MD决策方法,以更好地与所提出的框架合作:1)基于识别置信度度量的决策或2)仅基于语音识别结果。一系列普通话MD实验表明,我们的方法不仅简化了现有混合DNN-HMM系统的处理管道,而且还带来了系统的和实质性的性能改进。此外,带有文本提示的输入增强似乎对基于E2E的MD方法具有极好的希望。
Recently, end-to-end (E2E) automatic speech recognition (ASR) systems have garnered tremendous attention because of their great success and unified modeling paradigms in comparison to conventional hybrid DNN-HMM ASR systems. Despite the widespread adoption of E2E modeling frameworks on ASR, there still is a dearth of work on investigating the E2E frameworks for use in computer-assisted pronunciation learning (CAPT), particularly for Mispronunciation detection (MD). In response, we first present a novel use of hybrid CTCAttention approach to the MD task, taking advantage of the strengths of both CTC and the attention-based model meanwhile getting around the need for phone-level forced alignment. Second, we perform input augmentation with text prompt information to make the resulting E2E model more tailored for the MD task. On the other hand, we adopt two MD decision methods so as to better cooperate with the proposed framework: 1) decision-making based on a recognition confidence measure or 2) simply based on speech recognition results. A series of Mandarin MD experiments demonstrate that our approach not only simplifies the processing pipeline of existing hybrid DNN-HMM systems but also brings about systematic and substantial performance improvements. Furthermore, input augmentation with text prompts seems to hold excellent promise for the E2E-based MD approach.