DITCLIP：富含字典的视觉概念并联的预训练以进行开放世界检测

论文标题

DITCLIP：富含字典的视觉概念并联的预训练以进行开放世界检测

DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection

论文作者

Yao, Lewei, Han, Jianhua, Wen, Youpeng, Liang, Xiaodan, Xu, Dan, Zhang, Wei, Li, Zhenguo, Xu, Chunjing, Xu, Hang

论文摘要

作为一个更一般和具有挑战性的目标，开放世界对象检测旨在识别和本地化由任意类别名称描述的对象。最近的工作GLIP通过将检测数据集的所有类别名称串联到句子中，从而将该问题作为接地问题，从而导致类别名称之间的相互作用效率低下。本文介绍了Distclip，这是一种通过诉诸于设计概念词典的知识富集，是一种平行的视觉概念训练预训练方法，用于开放世界检测。为了提高学习效率，我们提出了一种新型的并行概念公式，该公式分别提取概念，以更好地利用异质数据集（即检测，接地和图像文本对）进行培训。我们进一步设计了来自各种在线资源和检测数据集的概念字典〜（带有描述），以提供每个概念的先验知识。通过丰富概念的描述，我们明确地建立了各种概念之间的关系，以促进开放域学习。提出的概念词典进一步用于提供足够的负面概念，以构建单词区域对齐损失\，并为在图像文本对数据的标题中丢失描述的对象完成标签。所提出的框架显示出强烈的零射击检测性能，例如，在LVIS数据集上，我们的DITCLIP-T优于9.9％的MAP，与与我们相同的主链相比，稀有类别的GLIP-T优于稀有类别13.5％。

Open-world object detection, as a more general and challenging goal, aims to recognize and localize objects described by arbitrary category names. The recent work GLIP formulates this problem as a grounding problem by concatenating all category names of detection datasets into sentences, which leads to inefficient interaction between category names. This paper presents DetCLIP, a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary. To achieve better learning efficiency, we propose a novel paralleled concept formulation that extracts concepts separately to better utilize heterogeneous datasets (i.e., detection, grounding, and image-text pairs) for training. We further design a concept dictionary~(with descriptions) from various online sources and detection datasets to provide prior knowledge for each concept. By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning. The proposed concept dictionary is further used to provide sufficient negative concepts for the construction of the word-region alignment loss\, and to complete labels for objects with missing descriptions in captions of image-text pair data. The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories compared to the fully-supervised model with the same backbone as ours.

下载PDF全文

下载文献需遵守相关版权规定

论文标题