论文标题
从单语ASR转移学习到无转录的跨语性语音转换
Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion
论文作者
论文摘要
跨语音语音转换(VC)是一项任务,旨在使目标声音具有相同的内容,而源和目标扬声器则以不同的语言说话。它的挑战在于,源数据和目标数据自然是非并行的,甚至很难在没有提供转录的情况下弥合语言之间的差距。在本文中,我们专注于从单元 - 兼语言转移到跨语义的VC的知识转移,以解决混合不匹配的问题。为了实现这一目标,我们首先训练源语言的单语言模型,使用它为VC数据集中的所有语音提取语音特征,然后训练SEQ2SEQ转换模型以预先介绍MEL-SPECTROGRAM。我们成功地解决了跨语言VC,而没有任何用于外国语音的转录或语言特定知识。我们在语音转换挑战2020数据集中进行了实验,并表明我们的依赖说话者的转换模型优于零射击基线,在语音质量上达到了3.83和3.54的MOS,并且具有跨语性转换的扬声器相似性。与Cascade ASR-TTS方法相比,我们提出的一种显着降低了MOS滴度之间的MOS滴剂和跨语性转换之间。
Cross-lingual voice conversion (VC) is a task that aims to synthesize target voices with the same content while source and target speakers speak in different languages. Its challenge lies in the fact that the source and target data are naturally non-parallel, and it is even difficult to bridge the gaps between languages with no transcriptions provided. In this paper, we focus on knowledge transfer from monolin-gual ASR to cross-lingual VC, in order to address the con-tent mismatch problem. To achieve this, we first train a monolingual acoustic model for the source language, use it to extract phonetic features for all the speech in the VC dataset, and then train a Seq2Seq conversion model to pre-dict the mel-spectrograms. We successfully address cross-lingual VC without any transcription or language-specific knowledge for foreign speech. We experiment this on Voice Conversion Challenge 2020 datasets and show that our speaker-dependent conversion model outperforms the zero-shot baseline, achieving MOS of 3.83 and 3.54 in speech quality and speaker similarity for cross-lingual conversion. When compared to Cascade ASR-TTS method, our proposed one significantly reduces the MOS drop be-tween intra- and cross-lingual conversion.