论文标题

使用预训练的模型和语音增强的低资源扬声器的噪声强大的TT

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

论文作者

Dai, Dongyang, Chen, Li, Wang, Yuping, Wang, Mu, Xia, Rui, Song, Xuchen, Wu, Zhiyong, Wang, Yuxuan

论文摘要

随着深度神经网络的流行,近年来,基于端到端编码器框架,语音综合任务取得了重大改进。越来越多的依赖语音合成技术的应用已在我们的日常生活中广泛使用。强大的语音合成模型取决于需要大量收集工作的高质量和定制数据。值得研究如何利用低质量和低资源语音数据,可以轻松从Internet获得合成个性化的语音。在本文中,拟议的端到端语音合成模型同时使用扬声器嵌入和噪声表示作为有条件的输入,以分别用于示范说话者和噪声信息。首先,语音合成模型均通过多演讲者清洁数据和嘈杂的增强数据进行预训练。然后,预先训练的模型适用于嘈杂的低资源新扬声器数据;最后,通过设置干净的语音条件,该模型可以综合新扬声器的干净声音。实验结果表明,与直接对具有新扬声器数据的直接微调预训练的多演讲者语音综合模型相比,提出的方法产生的语音具有更好的主观评估结果。

With the popularity of deep neural network, speech synthesis task has achieved significant improvements based on the end-to-end encoder-decoder framework in the recent days. More and more applications relying on speech synthesis technology have been widely used in our daily life. Robust speech synthesis model depends on high quality and customized data which needs lots of collecting efforts. It is worth investigating how to take advantage of low-quality and low resource voice data which can be easily obtained from the Internet for usage of synthesizing personalized voice. In this paper, the proposed end-to-end speech synthesis model uses both speaker embedding and noise representation as conditional inputs to model speaker and noise information respectively. Firstly, the speech synthesis model is pre-trained with both multi-speaker clean data and noisy augmented data; then the pre-trained model is adapted on noisy low-resource new speaker data; finally, by setting the clean speech condition, the model can synthesize the new speaker's clean voice. Experimental results show that the speech generated by the proposed approach has better subjective evaluation results than the method directly fine-tuning pre-trained multi-speaker speech synthesis model with denoised new speaker data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源