论文标题
前缀调节统一语言和标签监督
Prefix Conditioning Unifies Language and Label Supervision
论文作者
论文摘要
图像分类数据集已用于预处理图像识别模型。最近,Web尺度的图像映射数据集已成为强大的预处理替代方案的来源。与传统分类数据集相比,图像捕获数据集更为``开放域'',其中包含多种多样的场景类型和词汇单词,并且在这些数据集中训练的模型在少数和零发射的识别任务上表现出强烈的性能。当天真地统一图像分类和-CAPTION数据集时,我们表明,通过降低学习表示形式的普遍性,从而损害零拍摄性能,因为统一可以为分类数据集量身定制模型,从而使其易于从数据集合转移,从而对预训练产生了负面影响。在这项工作中,我们通过使用前缀令牌删除数据集偏置来解决该问题,该前缀令牌在培训时将输入数据集类型(例如,图像分类或字幕)告知语言编码器。这种方法允许语言编码器从两个数据集共享知识,并可以切换特征提取模式,即图像分类数据集或图像捕获数据集量身定制的模式,在此我们在零照片评估中使用图像捕捉模式。我们的方法是通用的,可以轻松地集成到现有的VL预训练预训练目标中,例如剪辑或UNICL。在实验中,我们表明这种简单的技术可提高零拍图像识别精度和鲁棒性的性能,从而提高了图像级分布变化。
Image-classification datasets have been used to pretrain image recognition models. Recently, web-scale image-caption datasets have emerged as a source of powerful pretraining alternative. Image-caption datasets are more ``open-domain'', containing a wider variety of scene types and vocabulary words than traditional classification datasets, and models trained on these datasets have demonstrated strong performance on few- and zero-shot recognition tasks. When naively unifying image-classification and -caption dataset, we show that such dataset biases negatively affect pre-training by reducing the generalizability of learned representations and thus jeopardizing zero-shot performance since the unification can tailor the model for the classification dataset, making it vulnerable to the distribution shift from the dataset. In this work, we address the problem by disentangling the dataset bias using prefix tokens that inform a language encoder of the type of the input dataset (e.g., image-classification or caption) at training time. This approach allows the language encoder to share the knowledge from two datasets as well as switch the mode of feature extraction, i.e., image-classification dataset or image-caption dataset tailored mode, where we use image-caption mode in the zero-shot evaluation. Our method is generic and can be easily integrated into existing VL pre-training objectives such as CLIP or UniCL. In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.