使用自我监督语音表示学习的自动发音评估

论文标题

使用自我监督语音表示学习的自动发音评估

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

论文作者

Kim, Eesung, Jeon, Jae-Jin, Seo, Hyeji, Kim, Hoon

论文摘要

自我监督的学习（SSL）方法（例如WAV2VEC 2.0和Hubert模型）在语音社区的各种下游任务中显示出令人鼓舞的结果。特别是，通过SSL模型学到的语音表示形式已被证明可有效编码与语音相关的各种特征。在这种情况下，我们提出了一种基于SSL模型的新型自动发音评估方法。首先，提出的方法微调具有连接式时间分类的预训练的SSL模型，以适应数据环境中英语的英语发音（ESL）学习者（ESL）学习者。然后，层面上下文表示从SSL模型的整个变压器层中提取。最后，使用双向长期记忆估计自动发音评分，并具有层的上下文表示和相应的文本。我们表明，在韩国ESL学习者儿童和Speechocean762的数据集上，所提出的基于SSL模型的方法在Pearson相关系数方面优于基准。此外，我们分析了SSL模型中变压器层的不同表示形式如何影响发音评估任务的性能。

Self-supervised learning (SSL) approaches such as wav2vec 2.0 and HuBERT models have shown promising results in various downstream tasks in the speech community. In particular, speech representations learned by SSL models have been shown to be effective for encoding various speech-related characteristics. In this context, we propose a novel automatic pronunciation assessment method based on SSL models. First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification to adapt the English pronunciation of English-as-a-second-language (ESL) learners in a data environment. Then, the layer-wise contextual representations are extracted from all across the transformer layers of the SSL models. Finally, the automatic pronunciation score is estimated using bidirectional long short-term memory with the layer-wise contextual representations and the corresponding text. We show that the proposed SSL model-based methods outperform the baselines, in terms of the Pearson correlation coefficient, on datasets of Korean ESL learner children and Speechocean762. Furthermore, we analyze how different representations of transformer layers in the SSL model affect the performance of the pronunciation assessment task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题