论文标题

LRTA:一个透明的神经符号推理框架,带有模块化监督以进行视觉问题回答

LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering

论文作者

Liang, Weixin, Niu, Feiyang, Reganti, Aishwarya, Thattai, Govind, Tur, Gokhan

论文摘要

视觉问题回答(VQA)的主要方法依赖于用“黑框”神经编码器编码图像和问题,并将单个令牌解码为“是”或“否”之类的答案。尽管这种方法的定量结果很强,但它仍在为预测过程提出直观,可读的理由形式。为了解决这一不足,我们将VQA重新出现为一项完整的答案生成任务,该任务要求模型以自然语言的预测合理。我们提出了LRTA [查看,阅读,思考,答案],这是一个透明的神经符号推理框架,用于视觉问题答案,可以像人类一样逐步解决该问题,并在每个步骤中提供人类可读的理由形式。具体而言,LRTA学会了首先将图像转换为场景图,并将问题解析为多个推理说明。然后,它通过使用复发性神经符号执行模块遍历场景图,一次执行推理指令。最后,它通过自然语言理由为给定的问题产生完整的答案。我们在GQA数据集上的实验表明,LRTA在完整的答案生成任务上以很大的利润率(43.1%V.S. 28.0%)优于最先进的模型。我们还通过删除语言提示(属性和关系)的问题来创建一个受干扰的GQA测试,以分析模型是否具有具有浅表数据相关性的智能猜测。我们表明,LRTA迈出了真正理解这个问题的一步,而最先进的模型倾向于从培训数据中学习表面相关性。

The predominant approach to visual question answering (VQA) relies on encoding the image and question with a "black-box" neural encoder and decoding a single token as the answer like "yes" or "no". Despite this approach's strong quantitative results, it struggles to come up with intuitive, human-readable forms of justification for the prediction process. To address this insufficiency, we reformulate VQA as a full answer generation task, which requires the model to justify its predictions in natural language. We propose LRTA [Look, Read, Think, Answer], a transparent neural-symbolic reasoning framework for visual question answering that solves the problem step-by-step like humans and provides human-readable form of justification at each step. Specifically, LRTA learns to first convert an image into a scene graph and parse a question into multiple reasoning instructions. It then executes the reasoning instructions one at a time by traversing the scene graph using a recurrent neural-symbolic execution module. Finally, it generates a full answer to the given question with natural language justifications. Our experiments on GQA dataset show that LRTA outperforms the state-of-the-art model by a large margin (43.1% v.s. 28.0%) on the full answer generation task. We also create a perturbed GQA test set by removing linguistic cues (attributes and relations) in the questions for analyzing whether a model is having a smart guess with superficial data correlations. We show that LRTA makes a step towards truly understanding the question while the state-of-the-art model tends to learn superficial correlations from the training data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源