论文标题

作为外语的图像:为所有视觉和视觉语言任务进行预训练

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

论文作者

Wang, Wenhui, Bao, Hangbo, Dong, Li, Bjorck, Johan, Peng, Zhiliang, Liu, Qiang, Aggarwal, Kriti, Mohammed, Owais Khan, Singhal, Saksham, Som, Subhojit, Wei, Furu

论文摘要

语言,视觉和多模式预处理的大量融合正在出现。在这项工作中,我们介绍了通用多模式基础模型BEIT-3,该模型BEIT-3,该模型在视觉和视觉任务上都实现了最先进的转移性能。具体来说,我们从三个方面提出了很大的融合:骨干架构,预处理任务和模型扩展。我们介绍了多道路变压器进行通用建模,其中模块化体系结构可以实现深融合和模态特定的编码。基于共享的骨干,我们以统一的方式对图像(Imglish),文本(英语)和图像文本对(“平行句子”)进行蒙版的“语言”建模。实验结果表明,BEIT-3在对象检测(COCO),语义细分(ADE20K),图像分类(ImagEnet),视觉推理(NLVR2),视觉询问答录(VQAV2),图像字幕(COCO)(COCO)和交叉模式检索(Flickr30k,coco,coco,coco,coco)中获得最先进的性能。

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源