LRTA：一个透明的神经符号推理框架，带有模块化监督以进行视觉问题回答

论文标题

LRTA：一个透明的神经符号推理框架，带有模块化监督以进行视觉问题回答

LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering

论文作者

Liang, Weixin, Niu, Feiyang, Reganti, Aishwarya, Thattai, Govind, Tur, Gokhan

论文摘要

视觉问题回答（VQA）的主要方法依赖于用“黑框”神经编码器编码图像和问题，并将单个令牌解码为“是”或“否”之类的答案。尽管这种方法的定量结果很强，但它仍在为预测过程提出直观，可读的理由形式。为了解决这一不足，我们将VQA重新出现为一项完整的答案生成任务，该任务要求模型以自然语言的预测合理。我们提出了LRTA [查看，阅读，思考，答案]，这是一个透明的神经符号推理框架，用于视觉问题答案，可以像人类一样逐步解决该问题，并在每个步骤中提供人类可读的理由形式。具体而言，LRTA学会了首先将图像转换为场景图，并将问题解析为多个推理说明。然后，它通过使用复发性神经符号执行模块遍历场景图，一次执行推理指令。最后，它通过自然语言理由为给定的问题产生完整的答案。我们在GQA数据集上的实验表明，LRTA在完整的答案生成任务上以很大的利润率（43.1％V.S. 28.0％）优于最先进的模型。我们还通过删除语言提示（属性和关系）的问题来创建一个受干扰的GQA测试，以分析模型是否具有具有浅表数据相关性的智能猜测。我们表明，LRTA迈出了真正理解这个问题的一步，而最先进的模型倾向于从培训数据中学习表面相关性。

The predominant approach to visual question answering (VQA) relies on encoding the image and question with a "black-box" neural encoder and decoding a single token as the answer like "yes" or "no". Despite this approach's strong quantitative results, it struggles to come up with intuitive, human-readable forms of justification for the prediction process. To address this insufficiency, we reformulate VQA as a full answer generation task, which requires the model to justify its predictions in natural language. We propose LRTA [Look, Read, Think, Answer], a transparent neural-symbolic reasoning framework for visual question answering that solves the problem step-by-step like humans and provides human-readable form of justification at each step. Specifically, LRTA learns to first convert an image into a scene graph and parse a question into multiple reasoning instructions. It then executes the reasoning instructions one at a time by traversing the scene graph using a recurrent neural-symbolic execution module. Finally, it generates a full answer to the given question with natural language justifications. Our experiments on GQA dataset show that LRTA outperforms the state-of-the-art model by a large margin (43.1% v.s. 28.0%) on the full answer generation task. We also create a perturbed GQA test set by removing linguistic cues (attributes and relations) in the questions for analyzing whether a model is having a smart guess with superficial data correlations. We show that LRTA makes a step towards truly understanding the question while the state-of-the-art model tends to learn superficial correlations from the training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题