WG-Wavenet：无GPU的实时高保真语音综合

论文标题

WG-Wavenet：无GPU的实时高保真语音综合

WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU

论文作者

Hsu, Po-chun, Lee, Hung-yi

论文摘要

在本文中，我们提出了WG-WAVENET，这是快速，轻巧且高质量的波形生成模型。 WG-Wavenet由基于紧凑的流量模型和后过滤器组成。通过最大化训练数据的可能性并优化频域上的损失功能，可以共同训练这两个组件。当我们设计一个基于流量的模型时，与其他波形生成模型相比，在训练时间和推理时间内，所提出的模型所需的计算资源要少得多。即使模型被高度压缩，后过滤器仍保持生成的波形的质量。我们的Pytorch实现可以使用少于8 GB GPU内存的训练，并在NVIDIA 1080TI GPU上以超过960 kHz的速度生成音频样品。此外，即使在CPU上合成，我们也表明该方法能够生成44.1 kHz语音波形的速度比实时快1.2倍。实验还表明，生成的音频的质量与其他方法的质量相当。音频样本可在线公开提供。

In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform. Our PyTorch implementation can be trained using less than 8 GB GPU memory and generates audio samples at a rate of more than 960 kHz on an NVIDIA 1080Ti GPU. Furthermore, even if synthesizing on a CPU, we show that the proposed method is capable of generating 44.1 kHz speech waveform 1.2 times faster than real-time. Experiments also show that the quality of generated audio is comparable to those of other methods. Audio samples are publicly available online.

下载PDF全文

下载文献需遵守相关版权规定

论文标题