论文标题
Prodiff:高质量文本到语音的渐进快速扩散模型
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech
论文作者
论文摘要
在许多生成任务中,去核扩散概率模型(DDPM)最近在许多生成任务中取得了领先的表现。但是,继承的迭代采样过程成本阻碍了他们的应用程序到文本到语音部署。通过有关扩散模型参数化的初步研究,我们发现以前基于梯度的TTS模型需要数百或数千个迭代以保证高样本质量,这对加速采样带来了挑战。在这项工作中,我们提出了Prodiff,以渐进的快速扩散模型,用于高质量的文本到语音。与先前估计数据密度梯度的工作不同,Prodiff通过直接预测清洁数据来避免在加速采样时避免明显的质量降解来参数化denoising模型。为了通过减少的扩散迭代来应对模型收敛挑战,Prodiff通过知识蒸馏减少了目标位点的数据差异。具体而言,Denoising模型使用N-Step DDIM老师的生成的MEL光谱图作为训练目标,并将行为提炼为具有N/2步的新模型。因此,它允许TTS模型做出清晰的预测,并通过数量级进一步减少采样时间。我们的评估表明,Prodiff仅需要两次迭代即可合成高保真性MEL光谱图,同时使用数百个步骤保持样本质量和多样性与最先进的模型竞争。 Prodiff在单个NVIDIA 2080TI GPU上的采样速度比实时快24倍,这使得扩散模型实际上是第一次适用于文本到语音综合部署。我们广泛的消融研究表明,Prodiff中的每种设计都是有效的,我们进一步表明,Prodiff可以轻松扩展到多演讲者的设置。音频样本可在\ url {https://prodiff.github.io/。}上找到
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at \url{https://ProDiff.github.io/.}