论文标题

神经文本到语音与逐代兴奋的声音器

Neural text-to-speech with a modeling-by-generation excitation vocoder

论文作者

Song, Eunwoo, Hwang, Min-Jae, Yamamoto, Ryuichi, Kim, Jin-Seob, Kwon, Ohsung, Kim, Jae-Min

论文摘要

本文提出了一种逐代建模(MBG)激发声码器,用于神经文本到语音(TTS)系统。最近提出的神经激发声码编码器可以通过将声带过滤器与基于Waveet的Glottal激发发生器相结合,可以实现合格的波形产生。但是,当在TTS系统中使用这些声码器时,由于训练和合成步骤之间的不匹配,综合语音的质量通常会降低。具体而言,Vocoder与声学模型前端分别训练。因此,声音模型的估计误差不可避免地会在Vocoder后端的整个合成过程中提高。为了解决这个问题,我们建议将MBG结构纳入Vocoder的培训过程中。在提出的方法中,激发信号是由声学模型的生成的光谱参数提取的,然后对神经声码器进行了优化,不仅要学习目标激发的分布,还可以补偿来自声学模型发生的估计误差。此外,由于生成的光谱参数在训练和合成步骤中共享,因此可以有效地减少它们的不匹配条件。实验结果验证了所提出的系统通过在TTS框架内达到4.57的平均意见评分来提供高质量的综合语音。

This paper proposes a modeling-by-generation (MbG) excitation vocoder for a neural text-to-speech (TTS) system. Recently proposed neural excitation vocoders can realize qualified waveform generation by combining a vocal tract filter with a WaveNet-based glottal excitation generator. However, when these vocoders are used in a TTS system, the quality of synthesized speech is often degraded owing to a mismatch between training and synthesis steps. Specifically, the vocoder is separately trained from an acoustic model front-end. Therefore, estimation errors of the acoustic model are inevitably boosted throughout the synthesis process of the vocoder back-end. To address this problem, we propose to incorporate an MbG structure into the vocoder's training process. In the proposed method, the excitation signal is extracted by the acoustic model's generated spectral parameters, and the neural vocoder is then optimized not only to learn the target excitation's distribution but also to compensate for the estimation errors occurring from the acoustic model. Furthermore, as the generated spectral parameters are shared in the training and synthesis steps, their mismatch conditions can be reduced effectively. The experimental results verify that the proposed system provides high-quality synthetic speech by achieving a mean opinion score of 4.57 within the TTS framework.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源