Wolonet：波浪望远镜的高效和高保真语音综合

论文标题

Wolonet：波浪望远镜的高效和高保真语音综合

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

论文作者

Wang, Yi, Si, Yi

论文摘要

最近，基于GAN的神经声码器（如平行波形，梅尔根，Hifigan和Univnet）由于其轻巧且平行的结构而变得流行，从而导致具有高保真性的实时合成波形，即使在CPU上也是如此。 Hifigan和Univnet是两个Sota Vocoders。尽管它们质量很高，但仍然有改进的余地。在本文中，由计算机视觉的视觉望远镜结构的动机，我们采用了一个类似的想法，并提出了一个有效且轻量级的神经声码器，称为Wolonet。在这个网络中，我们开发了一个新型的轻质块，该块使用具有正弦激活的动态核重量的位置变化，与通道无关和深度的动态卷积内核。为了证明我们方法的有效性和概括性，我们进行了一项消融研究，以验证我们的新型设计，并与典型的基于GAN的声音编码器进行主观和客观的比较。结果表明，与两个神经SOTA声码器Hifigan和Univnet相比，我们的Wolonet达到了最佳的一代质量，同时需要的参数少。

Recently, GAN-based neural vocoders such as Parallel WaveGAN, MelGAN, HiFiGAN, and UnivNet have become popular due to their lightweight and parallel structure, resulting in a real-time synthesized waveform with high fidelity, even on a CPU. HiFiGAN and UnivNet are two SOTA vocoders. Despite their high quality, there is still room for improvement. In this paper, motivated by the structure of Vision Outlooker from computer vision, we adopt a similar idea and propose an effective and lightweight neural vocoder called WOLONet. In this network, we develop a novel lightweight block that uses a location-variable, channel-independent, and depthwise dynamic convolutional kernel with sinusoidally activated dynamic kernel weights. To demonstrate the effectiveness and generalizability of our method, we perform an ablation study to verify our novel design and make a subjective and objective comparison with typical GAN-based vocoders. The results show that our WOLONet achieves the best generation quality while requiring fewer parameters than the two neural SOTA vocoders, HiFiGAN and UnivNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题