测量视频问题回答的组成一致性

论文标题

测量视频问题回答的组成一致性

Measuring Compositional Consistency for Video Question Answering

论文作者

Gandhi, Mona, Gul, Mustafa Omer, Prakash, Eva, Grunde-McLaughlin, Madeleine, Krishna, Ranjay, Agrawala, Maneesh

论文摘要

最近的视频问题回答基准表明，最先进的模型难以回答构图问题。但是，目前尚不清楚哪种组成推理导致模型被错误预测。此外，很难辨别模型是使用构图推理或利用数据偏见得出的答案。在本文中，我们开发了一个问题分解引擎，该引擎可以编程地将构图问题解构为一个定向的无环图形图。该图的设计使每个父母问题都是其子女的组成。我们提出了AGQA-DECOMP，这是一种基准，其中包含230万美元的问题图，平均每张图$ 11.49 $，以及455万美元的总新子问题。使用问题图，我们评估了三个最先进的模型，并具有一系列新颖的组成一致性指标。我们发现，模型要么无法通过大多数作品正确推理，要么依靠错误的推理来达到答案，经常与自己相矛盾，或者在中间推理步骤失败时会取得高度准确性。

Recent video question answering benchmarks indicate that state-of-the-art models struggle to answer compositional questions. However, it remains unclear which types of compositional reasoning cause models to mispredict. Furthermore, it is difficult to discern whether models arrive at answers using compositional reasoning or by leveraging data biases. In this paper, we develop a question decomposition engine that programmatically deconstructs a compositional question into a directed acyclic graph of sub-questions. The graph is designed such that each parent question is a composition of its children. We present AGQA-Decomp, a benchmark containing $2.3M$ question graphs, with an average of $11.49$ sub-questions per graph, and $4.55M$ total new sub-questions. Using question graphs, we evaluate three state-of-the-art models with a suite of novel compositional consistency metrics. We find that models either cannot reason correctly through most compositions or are reliant on incorrect reasoning to reach answers, frequently contradicting themselves or achieving high accuracies when failing at intermediate reasoning steps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题