论文标题

UNIVL:统一的视频和语言预训练模型,用于多模式理解和发电

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

论文作者

Luo, Huaishao, Ji, Lei, Shi, Botian, Huang, Haoyang, Duan, Nan, Li, Tianrui, Li, Jason, Bharti, Taroon, Zhou, Ming

论文摘要

随着NLP和图像语言任务的最新培训技术的最新成功,一些视频语言前训练作品逐渐开发出来,以改善与视频文本相关的下游任务。但是,大多数现有的多模式模型都是为了理解任务的预训练,从而导致对生成任务的预处理差异。本文提出了Univl:一种统一的视频和语言预训练模型,用于多模式理解和产生。它包括四个组件,包括两个单模式编码器,一个交叉编码器和带有变压器主链的解码器。五个目标,包括视频文本关节,条件蒙版语言模型(CMLM),条件蒙版框架模型(CMFM),视频文本对齐和语言重建,旨在训练每个组件。我们进一步制定了两种预训练策略,逐个阶段预训练(STAGEDP)和增强视频表示(增强),以使UNIVL的培训过程更加有效。预训练是在相当大的教学视频数据集HOWTO100M上进行的。实验结果表明,UNIVL可以学习强大的视频文本表示,并在五个下游任务上实现最先进的结果。

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源