作为外语的图像：为所有视觉和视觉语言任务进行预训练

论文标题

作为外语的图像：为所有视觉和视觉语言任务进行预训练

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

论文作者

Wang, Wenhui, Bao, Hangbo, Dong, Li, Bjorck, Johan, Peng, Zhiliang, Liu, Qiang, Aggarwal, Kriti, Mohammed, Owais Khan, Singhal, Saksham, Som, Subhojit, Wei, Furu

论文摘要

语言，视觉和多模式预处理的大量融合正在出现。在这项工作中，我们介绍了通用多模式基础模型BEIT-3，该模型BEIT-3，该模型在视觉和视觉任务上都实现了最先进的转移性能。具体来说，我们从三个方面提出了很大的融合：骨干架构，预处理任务和模型扩展。我们介绍了多道路变压器进行通用建模，其中模块化体系结构可以实现深融合和模态特定的编码。基于共享的骨干，我们以统一的方式对图像（Imglish），文本（英语）和图像文本对（“平行句子”）进行蒙版的“语言”建模。实验结果表明，BEIT-3在对象检测（COCO），语义细分（ADE20K），图像分类（ImagEnet），视觉推理（NLVR2），视觉询问答录（VQAV2），图像字幕（COCO）（COCO）和交叉模式检索（Flickr30k，coco，coco，coco，coco）中获得最先进的性能。

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).

下载PDF全文

下载文献需遵守相关版权规定

论文标题