分层和多尺度的变性自动编码器，用于多样化和自然非自动性文本到语音

论文标题

分层和多尺度的变性自动编码器，用于多样化和自然非自动性文本到语音

Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

论文作者

Bae, Jae-Sung, Yang, Jinhyeok, Bak, Tae-Jun, Joo, Young-Sun

论文摘要

本文提出了一种基于层次和多尺度的分层自动编码器的非自动回调文本到语音模型（HIMUV-TTS），以通过多种说话风格生成自然语音。非自动回归TTS（NAR-TTS）模型的最新进展显着提高了合成语音的推理速度和鲁棒性。但是，需要改善口语风格和自然性的多样性。为了解决这个问题，我们提出了首先确定全球尺度韵律的HIMUV-TTS模型，然后通过对全球规模的韵律和学识渊博的文本表示来确定地方规模的韵律。此外，我们通过采用对抗训练技术来提高语音质量。实验结果证明，与具有单尺度变异自动编码器的TTS模型相比，提出的HIMUV-TTS模型可以产生更多样化和自然的语音，并且可以在每个量表中代表不同的韵律信息。

This paper proposes a hierarchical and multi-scale variational autoencoder-based non-autoregressive text-to-speech model (HiMuV-TTS) to generate natural speech with diverse speaking styles. Recent advances in non-autoregressive TTS (NAR-TTS) models have significantly improved the inference speed and robustness of synthesized speech. However, the diversity of speaking styles and naturalness are needed to be improved. To solve this problem, we propose the HiMuV-TTS model that first determines the global-scale prosody and then determines the local-scale prosody via conditioning on the global-scale prosody and the learned text representation. In addition, we improve the quality of speech by adopting the adversarial training technique. Experimental results verify that the proposed HiMuV-TTS model can generate more diverse and natural speech as compared to TTS models with single-scale variational autoencoders, and can represent different prosody information in each scale.

下载PDF全文

下载文献需遵守相关版权规定

论文标题