论文标题

波浪符号:无光谱图的端到端文本到语音综合

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

论文作者

Weiss, Ron J., Skerry-Ryan, RJ, Battenberg, Eric, Mariooryad, Soroosh, Kingma, Diederik P.

论文摘要

我们描述了一个序列到序列神经网络,该神经网络直接从文本输入中生成语音波形。该体系结构通过将归一化流程纳入自回归解码器环中来扩展TACOTRON模型。输出波形被建模为非重叠固定长度块的序列,每个块包含数百个样品。每个块中波形样本的相互依赖性使用归一化流量建模,从而实现并行训练和合成。长期的依赖性是通过在上述块上调节每个流的调节来进行自动调查的。该模型可以直接以最大似然的态度进行优化,并使用中间,手工设计的功能或其他损失项进行了淘汰。当代的最先进的文本到语音(TTS)系统使用单独学习的模型:一个(例如TaCotron),该模型(例如Tacotron)从文本中生成中间特征(例如频谱图),然后是Vocoder(例如Wavernn),该功能仪(例如Wavernn)生成了来自中间特征的saveform样品。相比之下,所提出的系统不使用固定的中间表示形式,并学习所有参数端到端。实验表明,所提出的模型以质量接近最先进的神经TTS系统的质量生成语音,并具有显着提高的生成速度。

We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within each block are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding blocks.This model can be optimized directly with maximum likelihood, with-out using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源