Sherlock Holmes的绑架：用于视觉绑架推理的数据集

论文标题

Sherlock Holmes的绑架：用于视觉绑架推理的数据集

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

论文作者

Hessel, Jack, Hwang, Jena D., Park, Jae Sung, Zellers, Rowan, Bhagavatula, Chandra, Rohrbach, Anna, Saenko, Kate, Choi, Yejin

论文摘要

人类具有出色的能力来推理绑架并假设超出图像字面内容的内容。通过识别散布在整个场景中的具体视觉线索，我们几乎不禁根据我们的日常经验和对世界的知识来提出可能的推论。例如，如果我们在道路旁边看到一个“ 20英里 /小时”的标志，我们可能会假设街道位于居民区（而不是在高速公路上），即使没有房屋。机器可以执行类似的视觉推理吗？我们提出了Sherlock，这是一个带注释的103K图像的语料库，用于测试机器能力，以超出字面图像内容的绑架推理。我们采用免费观看范式：参与者首先观察并识别图像中的显着线索（例如，对象，动作），然后给定线索，对场景提供了合理的推论。我们总共收集了363K（线索，推理）对，该对形成了首个绑架的视觉推理数据集。使用我们的语料库，我们测试了三个互补的绑架推理轴。我们将模型的能力评估为：i）从大型候选人语料库中检索相关推论； ii）通过边界框进行定位证据，以及iii）比较合理的推论，以匹配人类新收集的19k李克特级判断的诊断语料库的判断。尽管我们发现具有多任务目标的微调夹RN50x64优于强大的基准，但模型性能与人类一致之间存在着重要的净空。可在http://visualabduction.com/上获得数据，模型和排行榜

Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can't help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a "20 mph" sign alongside a road, we might assume the street sits in a residential area (rather than on a highway), even if no houses are pictured. Can machines perform similar visual reasoning? We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents. We adopt a free-viewing paradigm: participants first observe and identify salient clues within images (e.g., objects, actions) and then provide a plausible inference about the scene, given the clue. In total, we collect 363K (clue, inference) pairs, which form a first-of-its-kind abductive visual reasoning dataset. Using our corpus, we test three complementary axes of abductive reasoning. We evaluate the capacity of models to: i) retrieve relevant inferences from a large candidate corpus; ii) localize evidence for inferences via bounding boxes, and iii) compare plausible inferences to match human judgments on a newly-collected diagnostic corpus of 19K Likert-scale judgments. While we find that fine-tuning CLIP-RN50x64 with a multitask objective outperforms strong baselines, significant headroom exists between model performance and human agreement. Data, models, and leaderboard available at http://visualabduction.com/

下载PDF全文

下载文献需遵守相关版权规定

论文标题