在计算机辅助发音训练中发音的上下文感知的善良

论文标题

在计算机辅助发音训练中发音的上下文感知的善良

Context-aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training

论文作者

Shi, Jiatong, Huo, Nan, Jin, Qin

论文摘要

错误发音检测是计算机辅助发音训练（CAPT）系统的重要组成部分。最新的错位检测模型使用深层神经网络（DNN）进行声学建模，以及基于发音（GOP）算法的好处（GOP）进行发音评分。但是，基于共和党的评分模型有两个主要局限性：即（i）它们依赖于强迫对准，将语音分配到语音段中并独立使用它们进行评分，这忽略了段中音素之间的过渡；（ii）它们仅关注语音段，而语音段未能考虑跨音素的上下文效应（例如联络，遗漏，不完整的Plosive Sound等）。在这项工作中，我们提出了发音（CAGOP）评分模型的上下文感知的好处。特别是，将过渡因子和持续时间因子注入Cagop评分的两个因素。过渡因子识别音素之间的过渡，并将其应用于加权框架的共和党。此外，提出了基于自我注意力的语音持续时间建模，以将持续时间因子引入评分模型。所提出的评分模型显着胜过基线，在语音级别和句子级别的错误发音检测上，GOP模型的相对相对相对改善分别获得了20％和12％的相对改善。

Mispronunciation detection is an essential component of the Computer-Assisted Pronunciation Training (CAPT) systems. State-of-the-art mispronunciation detection models use Deep Neural Networks (DNN) for acoustic modeling, and a Goodness of Pronunciation (GOP) based algorithm for pronunciation scoring. However, GOP based scoring models have two major limitations: i.e., (i) They depend on forced alignment which splits the speech into phonetic segments and independently use them for scoring, which neglects the transitions between phonemes within the segment; (ii) They only focus on phonetic segments, which fails to consider the context effects across phonemes (such as liaison, omission, incomplete plosive sound, etc.). In this work, we propose the Context-aware Goodness of Pronunciation (CaGOP) scoring model. Particularly, two factors namely the transition factor and the duration factor are injected into CaGOP scoring. The transition factor identifies the transitions between phonemes and applies them to weight the frame-wise GOP. Moreover, a self-attention based phonetic duration modeling is proposed to introduce the duration factor into the scoring model. The proposed scoring model significantly outperforms baselines, achieving 20% and 12% relative improvement over the GOP model on the phoneme-level and sentence-level mispronunciation detection respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题