论文标题
语音合成系统的韵律学习机制无文本长度限制
Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit
论文作者
论文摘要
最近的神经语音合成系统逐渐集中在韵律的控制上,以提高综合语音的质量,但它们很少考虑韵律的变异性以及韵律和语义之间的相关性。在本文中,提出了一种韵律学习机制来对基于TTS系统的语音韵律进行建模,其中韵律学习者从Melspectrum提取语音信息,并与音素序列相结合以重建Mel-Spectrum。同时,引入了预训练的语言模型中文本的语义特征,以改善韵律预测结果。此外,提出了一种新型的自我发场结构,以局部关注,以提高对输入文本长度的限制,其中序列的相对位置信息由相对位置矩阵建模,以便不再需要该位置编码。关于英语和普通话的实验表明,在我们的模型中获得了更令人满意的韵律的语音。尤其是在普通话合成中,我们提出的模型比MOS间隙为0.08的基线模型,并且合成语音的总体自然性得到了显着改善。
Recent neural speech synthesis systems have gradually focused on the control of prosody to improve the quality of synthesized speech, but they rarely consider the variability of prosody and the correlation between prosody and semantics together. In this paper, a prosody learning mechanism is proposed to model the prosody of speech based on TTS system, where the prosody information of speech is extracted from the melspectrum by a prosody learner and combined with the phoneme sequence to reconstruct the mel-spectrum. Meanwhile, the sematic features of text from the pre-trained language model is introduced to improve the prosody prediction results. In addition, a novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length, where the relative position information of the sequence is modeled by the relative position matrices so that the position encodings is no longer needed. Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model. Especially in Mandarin synthesis, our proposed model outperforms baseline model with a MOS gap of 0.08, and the overall naturalness of the synthesized speech has been significantly improved.