基于DDSP的歌手声码器：一种新的基于减法的合成器和全面评估

论文标题

基于DDSP的歌手声码器：一种新的基于减法的合成器和全面评估

DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation

论文作者

Wu, Da-Yi, Hsiao, Wen-Yi, Yang, Fu-Rong, Friedman, Oscar, Jackson, Warren, Bruzenak, Scott, Liu, Yi-Wen, Yang, Yi-Hsuan

论文摘要

Vocoder是一种有条件的音频生成模型，可将声学特征（例如MEL光谱图）转换为波形。我们从可区分的数字信号处理（DDSP）中汲取灵感，我们提出了一个新的Vocoder，名为Sawsing，以唱歌。通过用线性的有限脉冲响应滤波过滤器过滤锯齿源信号，锯构成了唱歌声音的谐波部分，其系数是通过神经网络从输入mel-spectrogrication估算的。由于这种方法可以实施相位的连续性，因此，锯能够产生歌声，而无需许多现有声音编码器的相结合故障。此外，源过滤器的假设提供了一种归纳偏差，该偏置允许对少量数据进行培训。我们的实验表明，在资源有限的场景中，锯锯会更快地收敛，并且胜过最先进的生成对抗网络和基于扩散的声码器，只有3个训练记录和3小时的培训时间。

A vocoder is a conditional audio generation model that converts acoustic features such as mel-spectrograms into waveforms. Taking inspiration from Differentiable Digital Signal Processing (DDSP), we propose a new vocoder named SawSing for singing voices. SawSing synthesizes the harmonic part of singing voices by filtering a sawtooth source signal with a linear time-variant finite impulse response filter whose coefficients are estimated from the input mel-spectrogram by a neural network. As this approach enforces phase continuity, SawSing can generate singing voices without the phase-discontinuity glitch of many existing vocoders. Moreover, the source-filter assumption provides an inductive bias that allows SawSing to be trained on a small amount of data. Our experiments show that SawSing converges much faster and outperforms state-of-the-art generative adversarial network and diffusion-based vocoders in a resource-limited scenario with only 3 training recordings and a 3-hour training time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题