VQTT：具有自我监督的VQ声学功能的高保真文本到语音综合

论文标题

VQTT：具有自我监督的VQ声学功能的高保真文本到语音综合

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

论文作者

Du, Chenpeng, Guo, Yiwei, Chen, Xie, Yu, Kai

论文摘要

主流神经文本到语音（TTS）管道是一个级联系统，包括一个声学模型（AM），该模型（AM）可预测输入转录本的声学特征和根据给定的声学特征生成波形的Vocoder。但是，当前TTS系统中的声学特征通常是MEL光谱图，它以复杂的方式沿时间和频率轴高度相关，从而导致AM很难预测。尽管近期来自地面真相（GT）MEL光谱图的神经声码器可以产生高保真音频，但GT与AM中预测的MEL-SPECTROGRAM之间的差距使整个TTS系统的性能降低。在这项工作中，我们提出了由AM TXT2VEC和VOCODER VEC2WAV组成的VQTT，该VEC2VAV使用了自我监督的矢量量化（VQ）声学特征而不是MEL-SPECTROGRAM。我们相应地重新设计了AM和Vocoder。特别是，TXT2VEC基本上成为一个分类模型，而不是传统的回归模型，而VEC2WAV在Hifigan Generator之前使用其他功能编码器来平滑不连续的量化功能。我们的实验表明，使用自我监管的VQ声学特征时，VEC2WAV比Hifigan实现了比Hifigan更好的重建性能。此外，我们整个TTS系统VQTTS在所有当前公开可用的TTS系统中都在自然性方面取得了最新的性能。

The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature. However, the acoustic feature in current TTS systems is typically mel-spectrogram, which is highly correlated along both time and frequency axes in a complicated way, leading to a great difficulty for the AM to predict. Although high-fidelity audio can be generated by recent neural vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the predicted mel-spectrogram from AM degrades the performance of the entire TTS system. In this work, we propose VQTTS, consisting of an AM txt2vec and a vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic feature rather than mel-spectrogram. We redesign both the AM and the vocoder accordingly. In particular, txt2vec basically becomes a classification model instead of a traditional regression model while vec2wav uses an additional feature encoder before HifiGAN generator for smoothing the discontinuous quantized feature. Our experiments show that vec2wav achieves better reconstruction performance than HifiGAN when using self-supervised VQ acoustic feature. Moreover, our entire TTS system VQTTS achieves state-of-the-art performance in terms of naturalness among all current publicly available TTS systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题