论文标题

部分可观测时空混沌系统的无模型预测

Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

论文作者

Zhu, Ye, Wu, Yu, Olszewski, Kyle, Ren, Jian, Tulyakov, Sergey, Yan, Yan

论文摘要

扩散概率模型(DPM)由于其有希望的结果和对跨模式合成的支持,已成为有条件产生的流行方法。条件合成中的一个关键逃亡者是在条件输入和生成的输出之间实现高对应关系。大多数现有方法通过将先验纳入变分的下限中,隐含地学习了这种关系。在这项工作中,我们采用了另一条路线 - 我们通过最大化其共同信息来明确增强输入输出连接。为此,我们引入了有条件的离散对比扩散(CDCD)损失,并设计了两种对比扩散机制,以有效地将其纳入脱索过程,从而首次将扩散训练和对比度学习结合在一起,通过将其与常规变量目标联系起来。我们证明了我们的方法在具有多种多模式有条件合成任务的评估中的功效:舞蹈到音乐的产生,文本对图像综合以及班级调节图像综合。在每个方面,我们都会增强输入输出对应关系,并达到更高或竞争性的一般合成质量。此外,提出的方法改善了扩散模型的收敛性,在两个基准上将所需扩散步骤的数量减少了35%以上,从而大大提高了推理速度。

Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route -- we explicitly enhance input-output connections by maximizing their mutual information. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, combining the diffusion training and contrastive learning for the first time by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with diverse multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, as well as class-conditioned image synthesis. On each, we enhance the input-output correspondence and achieve higher or competitive general synthesis quality. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源