论文标题
在主链中与融合的粗到精细的视力语言预训练
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
论文作者
论文摘要
视力语言(VL)预训练最近受到了广泛关注。但是,大多数现有的端到端预训练方法只旨在解决诸如图像介质检索,视觉询问答录(VQA)和图像字幕之类的VL任务,以测试对图像的高级理解,或者仅针对诸如短语接地和对象检测等任务的目标区域级别的理解。我们提出了Fiber(基于回避的变压器),这是一种新的VL模型体系结构,可以无缝处理这两种类型的任务。 Fiber没有在单模式骨架后将专用的变压器层用于融合,而是通过将交叉注意力插入图像和文本骨干方面,将多模式融合到模型中,从而在记忆和性能方面带来了收益。此外,与以前的工作不同,它要么仅在图像文本数据上进行训练,要么在带有框级注释的细粒度数据上进行培训,我们提出了一种两阶段的预训练策略,该策略有效地使用了这两种数据:(i)基于图像文本数据的粗粒度预训练;然后是(ii)基于图像文本框数据的细粒度预训练。我们对各种VL任务进行全面的实验,从VQA,图像字幕和检索到短语接地,参考表达理解和对象检测。使用深层多模式融合,结合两阶段的预训练,Fiber对所有任务中强基础的强大基准提供了一致的性能改进,通常使用幅度的数据表现出色。代码可在https://github.com/microsoft/fiber上找到。
Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.