UNIVL：统一的视频和语言预训练模型，用于多模式理解和发电

论文标题

UNIVL：统一的视频和语言预训练模型，用于多模式理解和发电

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

论文作者

Luo, Huaishao, Ji, Lei, Shi, Botian, Huang, Haoyang, Duan, Nan, Li, Tianrui, Li, Jason, Bharti, Taroon, Zhou, Ming

论文摘要

随着NLP和图像语言任务的最新培训技术的最新成功，一些视频语言前训练作品逐渐开发出来，以改善与视频文本相关的下游任务。但是，大多数现有的多模式模型都是为了理解任务的预训练，从而导致对生成任务的预处理差异。本文提出了Univl：一种统一的视频和语言预训练模型，用于多模式理解和产生。它包括四个组件，包括两个单模式编码器，一个交叉编码器和带有变压器主链的解码器。五个目标，包括视频文本关节，条件蒙版语言模型（CMLM），条件蒙版框架模型（CMFM），视频文本对齐和语言重建，旨在训练每个组件。我们进一步制定了两种预训练策略，逐个阶段预训练（STAGEDP）和增强视频表示（增强），以使UNIVL的培训过程更加有效。预训练是在相当大的教学视频数据集HOWTO100M上进行的。实验结果表明，UNIVL可以学习强大的视频文本表示，并在五个下游任务上实现最先进的结果。

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题