论文标题
声音模型工厂:用于生成音频建模的集成系统体系结构
Sound Model Factory: An Integrated System Architecture for Generative Audio Modelling
论文作者
论文摘要
我们介绍了一个新的系统,用于围绕两个不同的神经网络体系结构建立的数据驱动音频声音模型设计,即生成对抗网络(GAN)和一个经常性的神经网络(RNN),该系统利用了每个人的独特特征来实现每个人都无法单独解决的系统目标。该系统的目的是生成给定的可交互性可控声音模型(a)模型应能够合成的声音,以及(b)参数控件的规范,用于导航声音的空间。声音范围由设计器提供的数据集定义,而导航的方式由数据标签的组合以及从GAN学到的潜在空间中选择的子曼属的选择来定义。我们提出的系统利用了gan的丰富潜在空间,这些声音由填充空间的声音组成”。在“真实数据的声音”之间。然后使用GAN的增强数据来训练RNN的能力,以便其能够立即响应参数变化,并在无限的时间内生成音频的空间。音频音色之间的感知平滑插值。我们通过用户研究来验证这一过程。该系统为生成声音模型设计做出了进步,其中包括系统配置和用于改善插值的组件以及音乐音调和打击乐器声音以外的音频建模功能的扩展,以使音频纹理的空间更加复杂。
We introduce a new system for data-driven audio sound model design built around two different neural network architectures, a Generative Adversarial Network(GAN) and a Recurrent Neural Network (RNN), that takes advantage of the unique characteristics of each to achieve the system objectives that neither is capable of addressing alone. The objective of the system is to generate interactively controllable sound models given (a) a range of sounds the model should be able to synthesize, and (b) a specification of the parametric controls for navigating that space of sounds. The range of sounds is defined by a dataset provided by the designer, while the means of navigation is defined by a combination of data labels and the selection of a sub-manifold from the latent space learned by the GAN. Our proposed system takes advantage of the rich latent space of a GAN that consists of sounds that fill out the spaces ''between" real data-like sounds. This augmented data from the GAN is then used to train an RNN for its ability to respond immediately and continuously to parameter changes and to generate audio over unlimited periods of time. Furthermore, we develop a self-organizing map technique for ``smoothing" the latent space of GAN that results in perceptually smooth interpolation between audio timbres. We validate this process through user studies. The system contributes advances to the state of the art for generative sound model design that include system configuration and components for improving interpolation and the expansion of audio modeling capabilities beyond musical pitch and percussive instrument sounds into the more complex space of audio textures.