论文标题
可口可乐:对比字幕是图像文本基础模型
CoCa: Contrastive Captioners are Image-Text Foundation Models
论文作者
论文摘要
探索大规模预处理的基础模型对计算机视觉引起了重大兴趣,因为这些模型可以快速转移到许多下游任务中。本文介绍了对比字幕(COCA),这是一种极简主义的设计,旨在为图像文本编码器编码器基础模型提供了与对比损失和字幕损失的联合,从而从剪辑和诸如Simvlm之类的生成方法(例如生成方法)的对比方法中包含了模型能力。与所有解码器层都符合编码器输出的标准编码器 - 码头变压器相反,COCA省略了解码器层上半部分的交叉注意,以编码非模态文本表示形式,并层叠剩余的解码器层,这些解码器将其交叉与图像编码器的多模式图像图像-Textscantivations的图像编码器相交。除了对多模式解码器输出的字幕损失之外,我们还应用了单峰图像和文本嵌入之间的对比损失,该输出可以预测文本令牌自动加压。通过共享相同的计算图,可以用最小的开销有效地计算两个培训目标。可口可乐是端到端和从头开始的网络尺度Alt-text数据和带注释的图像,通过将所有标签视为文本,无缝地统一自然语言监督以进行表示。从经验上讲,可口可乐以零拍转或最少的特定任务适应在广泛的下游任务上实现最新性能NLVR2)和图像字幕(MSCOCO,NOCAPS)。值得注意的是,在Imagenet分类方面,可口可乐获得了86.3%的TOP-1准确性,带有冷冻的编码器和学习的分类头90.6%,以及带有命运编码器的Imagenet上的新最先进的91.0%Top-1 Top-1精度。
Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.