文本到语音系统的神经声码器的不匹配细化的周期性过滤方法

论文标题

文本到语音系统的神经声码器的不匹配细化的周期性过滤方法

A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems

论文作者

Wu, Yi-Chiao, Tobing, Patrick Lumban, Yasuhara, Kazuki, Matsunaga, Noriyuki, Ohtani, Yamato, Toda, Tomoki

论文摘要

最近，已经显示了文本到语音（TTS）系统与神经声码编码结合产生高保真语音的有效性。但是，收集所需的培训数据并从头开始构建这些高级系统是时间和资源的消费。一种经济的方法是开发神经声码器，以增强现有或低成本TTS系统产生的语音。但是，这种方法通常遭受两个问题：1）TTS与自然波形之间的时间不匹配以及2）训练和测试数据之间的声学不匹配。为了解决这些问题，我们采用循环语音转换（VC）模型来生成临时匹配的伪VC数据，以进行培训和声学匹配的增强数据，以测试神经声码器。由于普遍性，该框架可以应用于任意的TTS系统和神经声码器。在本文中，我们将提出的方法与最先进的WaveNet Vocoder一起应用于两个不同的基本TTS系统，并且客观和主观的实验结果都证实了所提出的框架的有效性。

Recently, the effectiveness of text-to-speech (TTS) systems combined with neural vocoders to generate high-fidelity speech has been shown. However, collecting the required training data and building these advanced systems from scratch are time and resource consuming. An economical approach is to develop a neural vocoder to enhance the speech generated by existing or low-cost TTS systems. Nonetheless, this approach usually suffers from two issues: 1) temporal mismatches between TTS and natural waveforms and 2) acoustic mismatches between training and testing data. To address these issues, we adopt a cyclic voice conversion (VC) model to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the neural vocoders. Because of the generality, this framework can be applied to arbitrary TTS systems and neural vocoders. In this paper, we apply the proposed method with a state-of-the-art WaveNet vocoder for two different basic TTS systems, and both objective and subjective experimental results confirm the effectiveness of the proposed framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题