声音引导的语义视频生成

论文标题

声音引导的语义视频生成

Sound-Guided Semantic Video Generation

论文作者

Lee, Seung Hyun, Oh, Gyeongrok, Byeon, Wonmin, Kim, Chanyoung, Ryoo, Won Jeong, Yoon, Sang Ho, Cho, Hyunjun, Bae, Jihyun, Kim, Jinkyu, Kim, Sangpil

论文摘要

StyleGan最近的成功表明，预训练的Stylegan潜在空间对现实的视频生成有用。但是，由于难以确定stylegan潜在空间的方向和幅度，因此视频中产生的运动通常在语义上没有意义。在本文中，我们提出了一个框架来通过利用多模式（声音图像文本）嵌入空间来生成现实视频。由于声音提供了场景的时间上下文，因此我们的框架学会了生成与声音一致的视频。首先，我们的声音反演模块将音频直接映射到Stylegan潜在空间中。然后，我们结合了基于夹子的多模式嵌入空间，以进一步提供视听关系。最后，提出的帧发电机学会在潜在空间中找到轨迹，该空间与相应的声音相干，并以层次结构方式生成视频。我们为发声的视频生成任务提供新的高分辨率景观视频数据集（视听对）。实验表明，我们的模型在视频质量方面优于最新方法。我们进一步显示了几种应用程序，包括图像和视频编辑，以验证我们方法的有效性。

The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题