用语言规范指导视觉关注

论文标题

用语言规范指导视觉关注

On Guiding Visual Attention with Language Specification

论文作者

Petryk, Suzanne, Dunlap, Lisa, Nasseri, Keyan, Gonzalez, Joseph, Darrell, Trevor, Rohrbach, Anna

论文摘要

尽管现实世界的挑战通常用语言单词或短语定义视觉类别，但大多数视觉分类方法都用数值索引定义类别。但是，这些类的语言规范为有偏见和嘈杂的数据集提供了一个特别有用的先验，它可以帮助抑制与任务相关的哪些功能。最近，大规模的多模型模型已被证明可以从语言规范中识别出各种各样的高级概念，即使没有其他图像训练数据，但它们通常无法区分更细粒度的任务的类。相比之下，CNN可以提取细粒度歧视所需的微妙图像特征，但会过度拟合数据集中的任何偏差或噪声。我们的见解是使用高级语言规范作为将分类证据限制为与任务相关的特征而不是干扰因素的建议。为此，我们将与任务相关的单词或短语基础，该单词或短语带有预处理的大规模模型的注意图。然后，我们使用这种基础来监督分类器的空间关注，从分散注意力的情况下。我们表明，以这种方式监督空间关注可以改善具有偏见和嘈杂数据的分类任务的性能，包括大约3-15％的最差小组精确度提高和41-45％的公平指标相对改进。

While real world challenges typically define visual categories with language words or phrases, most visual classification methods define categories with numerical indices. However, the language specification of the classes provides an especially useful prior for biased and noisy datasets, where it can help disambiguate what features are task-relevant. Recently, large-scale multimodal models have been shown to recognize a wide variety of high-level concepts from a language specification even without additional image training data, but they are often unable to distinguish classes for more fine-grained tasks. CNNs, in contrast, can extract subtle image features that are required for fine-grained discrimination, but will overfit to any bias or noise in datasets. Our insight is to use high-level language specification as advice for constraining the classification evidence to task-relevant features, instead of distractors. To do this, we ground task-relevant words or phrases with attention maps from a pretrained large-scale model. We then use this grounding to supervise a classifier's spatial attention away from distracting context. We show that supervising spatial attention in this way improves performance on classification tasks with biased and noisy data, including about 3-15% worst-group accuracy improvements and 41-45% relative improvements on fairness metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题