论文标题
不同实例发现:实例感知的多标签图像识别的视觉转换器
Diverse Instance Discovery: Vision-Transformer for Instance-Aware Multi-Label Image Recognition
论文作者
论文摘要
以前关于多标签图像识别(MLIR)的作品通常使用CNN作为研究的起点。在本文中,我们将纯视觉变压器(VIT)作为研究基础,并充分利用具有长距离依赖模型的变压器的优势,以避免CNNS的缺点限于局部接受场。但是,对于包含来自不同类别,量表和空间关系的多个对象的多标签图像,单独使用全局信息并不是最佳的。我们的目标是利用VIT的贴片令牌和自我注意力的机制在多标签图像中挖掘丰富的实例,称为“多样化实例发现”(DID)。为此,我们分别提出了一个语义类别感知的模块和空间关系感知的模块,然后通过重新构造策略将两者结合起来,以获取实例吸引注意的注意图。最后,我们提出了一种基于弱监督的对象本地化方法来提取多尺度本地特征,以形成多视图管道。我们的方法仅需要在标签级别上进行弱监督的信息,不需要其他知识注入或其他强有力的信息。三个基准数据集的实验表明,在公平的实验比较下,我们的方法显着优于先前的工作,并实现最先进的结果。
Previous works on multi-label image recognition (MLIR) usually use CNNs as a starting point for research. In this paper, we take pure Vision Transformer (ViT) as the research base and make full use of the advantages of Transformer with long-range dependency modeling to circumvent the disadvantages of CNNs limited to local receptive field. However, for multi-label images containing multiple objects from different categories, scales, and spatial relations, it is not optimal to use global information alone. Our goal is to leverage ViT's patch tokens and self-attention mechanism to mine rich instances in multi-label images, named diverse instance discovery (DiD). To this end, we propose a semantic category-aware module and a spatial relationship-aware module, respectively, and then combine the two by a re-constraint strategy to obtain instance-aware attention maps. Finally, we propose a weakly supervised object localization-based approach to extract multi-scale local features, to form a multi-view pipeline. Our method requires only weakly supervised information at the label level, no additional knowledge injection or other strongly supervised information is required. Experiments on three benchmark datasets show that our method significantly outperforms previous works and achieves state-of-the-art results under fair experimental comparisons.