普通话的语音综合与降级扩散概率wasserstein gan

论文标题

普通话的语音综合与降级扩散概率wasserstein gan

Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN

论文作者

Cho, Yin-Ping, Tsao, Yu, Wang, Hsin-Min, Liu, Yi-Wen

论文摘要

唱歌声音综合（SVS）是从给定的音乐分数中的人类唱歌声音的计算机制作。为了有效，有效地完成端到端的SVS，这项工作采用了为高质量的语音和演唱语音综合而建立的声学模型神经辅助架构。具体而言，这项工作旨在通过结合扩散概率模型（DDPM）和\ emph {Wasserstein}生成对抗网络（WGAN）来构建声学模型的骨干，来追求综合声音中更高水平的表达性。除了提出的声学模型之外，采用了HIFI-GAN神经声码编码器，并进行了整合的微调，以确保最佳的端到端SVS系统的最佳合成质量。使用多弹奏MPOP600 Consarin Singing语音数据集评估了此端到端系统。在实验中，就音乐表现力和高频声学细节而言，所提出的系统对先前具有里程碑意义的系统表现出了改进。此外，对抗性声学模型无需执行重建目标而稳定地收敛，这表明所提出的DDPM和WGAN组合架构在基于替代GAN的SVS系统上的收敛稳定性。

Singing voice synthesis (SVS) is the computer production of a human-like singing voice from given musical scores. To accomplish end-to-end SVS effectively and efficiently, this work adopts the acoustic model-neural vocoder architecture established for high-quality speech and singing voice synthesis. Specifically, this work aims to pursue a higher level of expressiveness in synthesized voices by combining the diffusion denoising probabilistic model (DDPM) and \emph{Wasserstein} generative adversarial network (WGAN) to construct the backbone of the acoustic model. On top of the proposed acoustic model, a HiFi-GAN neural vocoder is adopted with integrated fine-tuning to ensure optimal synthesis quality for the resulting end-to-end SVS system. This end-to-end system was evaluated with the multi-singer Mpop600 Mandarin singing voice dataset. In the experiments, the proposed system exhibits improvements over previous landmark counterparts in terms of musical expressiveness and high-frequency acoustic details. Moreover, the adversarial acoustic model converged stably without the need to enforce reconstruction objectives, indicating the convergence stability of the proposed DDPM and WGAN combined architecture over alternative GAN-based SVS systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题