在主链中与融合的粗到精细的视力语言预训练

论文标题

在主链中与融合的粗到精细的视力语言预训练

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

论文作者

Dou, Zi-Yi, Kamath, Aishwarya, Gan, Zhe, Zhang, Pengchuan, Wang, Jianfeng, Li, Linjie, Liu, Zicheng, Liu, Ce, LeCun, Yann, Peng, Nanyun, Gao, Jianfeng, Wang, Lijuan

论文摘要

视力语言（VL）预训练最近受到了广泛关注。但是，大多数现有的端到端预训练方法只旨在解决诸如图像介质检索，视觉询问答录（VQA）和图像字幕之类的VL任务，以测试对图像的高级理解，或者仅针对诸如短语接地和对象检测等任务的目标区域级别的理解。我们提出了Fiber（基于回避的变压器），这是一种新的VL模型体系结构，可以无缝处理这两种类型的任务。 Fiber没有在单模式骨架后将专用的变压器层用于融合，而是通过将交叉注意力插入图像和文本骨干方面，将多模式融合到模型中，从而在记忆和性能方面带来了收益。此外，与以前的工作不同，它要么仅在图像文本数据上进行训练，要么在带有框级注释的细粒度数据上进行培训，我们提出了一种两阶段的预训练策略，该策略有效地使用了这两种数据：（i）基于图像文本数据的粗粒度预训练；然后是（ii）基于图像文本框数据的细粒度预训练。我们对各种VL任务进行全面的实验，从VQA，图像字幕和检索到短语接地，参考表达理解和对象检测。使用深层多模式融合，结合两阶段的预训练，Fiber对所有任务中强基础的强大基准提供了一致的性能改进，通常使用幅度的数据表现出色。代码可在https://github.com/microsoft/fiber上找到。

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题