言语不够，它们的顺序很重要：基于接地视觉参考表达式

论文标题

言语不够，它们的顺序很重要：基于接地视觉参考表达式

Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

论文作者

Akula, Arjun R, Gella, Spandana, Al-Onaizan, Yaser, Zhu, Song-Chun, Reddy, Siva

论文摘要

视觉参考表达识别是一项具有挑战性的任务，需要在图像的背景下进行自然语言理解。我们使用人类研究进行了批判性检查Refcocog，这是该任务的标准基准，并表明83.7％的测试实例不需要关于语言结构的推理，即，单词足以识别目标对象，单词顺序无关紧要。为了衡量现有模型的真实进度，我们将测试集分为两组，一组需要在语言结构上进行推理，另一个则需要推理。此外，我们通过要求人群工人扰动域内示例，以使目标对象发生变化，从而创建一个脱离分布数据集Ref-ADV。使用这些数据集，我们从经验上表明，现有方法无法利用语言结构，并且性能低12％至23％。我们还提出了两种方法，一种基于对比度学习，另一种基于多任务学习，以提高维尔伯特的鲁棒性，维尔伯特是该任务的当前最新模型。我们的数据集可在https://github.com/aws/aws-refcocog-adv上公开获取

Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn't matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn't. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12% to 23% lower in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at https://github.com/aws/aws-refcocog-adv

下载PDF全文

下载文献需遵守相关版权规定

论文标题