使用大型未标记的语音语料库进行低资源文本到语音的转移学习框架

论文标题

使用大型未标记的语音语料库进行低资源文本到语音的转移学习框架

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

论文作者

Kim, Minchan, Jeong, Myeonghun, Choi, Byoung Jin, Ahn, Sunghwan, Lee, Joun Yeop, Kim, Nam Soo

论文摘要

培训文本到语音（TTS）模型需要大规模文本标有语音语料库，这很麻烦。在本文中，我们为TTS提出了一个转移学习框架，该框架利用大量未标记的语音数据集进行预训练。通过利用WAV2VEC2.0表示，未标记的语音可以高度提高性能，尤其是在缺乏标记的语音的情况下。我们还将提出的方法扩展到零击的多演讲者TTS（ZS-TTS）。实验结果验证了拟议方法在自然性，清晰度和说话者概括方面的有效性。我们强调的是，单个扬声器TTS模型在仅10分钟的标记数据集上进行了微调，并且在仅30分钟的单个扬声器数据集的仅30分钟内对ZS-TTS模型进行了微调，并且可以通过在未经贴新的多启示扬声器语音库中进行预先培训来生成任意扬声器的声音。

Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on unlabeled multi-speaker speech corpus.

下载PDF全文

下载文献需遵守相关版权规定

论文标题