论文标题

基于DDSP的歌手声码器:一种新的基于减法的合成器和全面评估

DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation

论文作者

Wu, Da-Yi, Hsiao, Wen-Yi, Yang, Fu-Rong, Friedman, Oscar, Jackson, Warren, Bruzenak, Scott, Liu, Yi-Wen, Yang, Yi-Hsuan

论文摘要

Vocoder是一种有条件的音频生成模型,可将声学特征(例如MEL光谱图)转换为波形。我们从可区分的数字信号处理(DDSP)中汲取灵感,我们提出了一个新的Vocoder,名为Sawsing,以唱歌。通过用线性的有限脉冲响应滤波过滤器过滤锯齿源信号,锯构成了唱歌声音的谐波部分,其系数是通过神经网络从输入mel-spectrogrication估算的。由于这种方法可以实施相位的连续性,因此,锯能够产生歌声,而无需许多现有声音编码器的相结合故障。此外,源过滤器的假设提供了一种归纳偏差,该偏置允许对少量数据进行培训。我们的实验表明,在资源有限的场景中,锯锯会更快地收敛,并且胜过最先进的生成对抗网络和基于扩散的声码器,只有3个训练记录和3小时的培训时间。

A vocoder is a conditional audio generation model that converts acoustic features such as mel-spectrograms into waveforms. Taking inspiration from Differentiable Digital Signal Processing (DDSP), we propose a new vocoder named SawSing for singing voices. SawSing synthesizes the harmonic part of singing voices by filtering a sawtooth source signal with a linear time-variant finite impulse response filter whose coefficients are estimated from the input mel-spectrogram by a neural network. As this approach enforces phase continuity, SawSing can generate singing voices without the phase-discontinuity glitch of many existing vocoders. Moreover, the source-filter assumption provides an inductive bias that allows SawSing to be trained on a small amount of data. Our experiments show that SawSing converges much faster and outperforms state-of-the-art generative adversarial network and diffusion-based vocoders in a resource-limited scenario with only 3 training recordings and a 3-hour training time.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源