论文标题
Make-A-Video:没有文本视频数据的文本到视频生成
Make-A-Video: Text-to-Video Generation without Text-Video Data
论文作者
论文摘要
我们提出了Make-A-Video-一种直接将文本形象(T2I)生成最新进展的方法直接转化为文本对视频(T2V)的方法。我们的直觉很简单:了解世界的外观以及如何从配对的文本图像数据中描述它,并了解世界如何从无监督的录像中移动。 Make-A-Video具有三个优点:(1)它加速了T2V模型的培训(它不需要从头开始学习视觉和多模式表示),(2)它不需要成对的文本视频数据,(3)生成的视频继承了广阔的广泛(幻想,梦幻般的幻想,奇妙的绘制摄入量,等等)模型的模型。我们设计了一种简单而有效的方法,可以在具有新颖有效的时空模块的T2I模型上构建T2I模型。首先,我们分解了完整的时间U-NET和注意力张量,并在空间和时间上近似它们。其次,我们设计了一个空间时间管道,以使用视频解码器,插值模型和两个超级分辨率模型生成高分辨率和帧率视频,这些模型可以启用除T2V以外的各种应用程序。在各个方面,空间和时间分辨率,对文本的忠诚以及质量,Make-A-Video都设置了文本到视频生成的新最新,这是由定性和定量措施确定的。
We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.