论文标题
语言驱动的图形胶囊网络用于视觉问题推理
Linguistically Driven Graph Capsule Network for Visual Question Reasoning
论文作者
论文摘要
最近,对视觉问题回答的研究探索了端到端网络的各种架构,并在自然和合成数据集上取得了有希望的结果,这些数据集需要明确的组成推理。但是,有人认为这些黑框方法缺乏结果的解释性,因此由于数据集偏置过度拟合而无法在概括任务上表现良好。在这项工作中,我们旨在结合双方的好处,并克服其局限性,以实现无需布局注释而无需进行布局注释的一般图像的端到端可解释的结构推理。受胶囊网络属性的启发,该胶囊网络可以在常规的卷积神经网络(CNN)内雕刻树结构,我们提出了一个层次结构组成推理模型,称为“语言驱动的图形胶囊网络”,其中组成过程由语言解析树指导。具体而言,我们将最低层中的每个胶囊绑定在原始问题中的单个单词的语言嵌入中,并使用视觉证据嵌入,然后将它们插入同一胶囊,如果它们是解析树中的兄弟姐妹。通过对语言驱动的条件随机场(CRF)进行推断,可以实现此组成过程,并在多个图形囊层上执行,从而导致CNN内部的组成推理过程。在CLEVR数据集,CLEVR组成生成测试和FigeQA数据集上进行的实验证明了我们的端到端模型的有效性和组成概括能力。
Recently, studies of visual question answering have explored various architectures of end-to-end networks and achieved promising results on both natural and synthetic datasets, which require explicitly compositional reasoning. However, it has been argued that these black-box approaches lack interpretability of results, and thus cannot perform well on generalization tasks due to overfitting the dataset bias. In this work, we aim to combine the benefits of both sides and overcome their limitations to achieve an end-to-end interpretable structural reasoning for general images without the requirement of layout annotations. Inspired by the property of a capsule network that can carve a tree structure inside a regular convolutional neural network (CNN), we propose a hierarchical compositional reasoning model called the "Linguistically driven Graph Capsule Network", where the compositional process is guided by the linguistic parse tree. Specifically, we bind each capsule in the lowest layer to bridge the linguistic embedding of a single word in the original question with visual evidence and then route them to the same capsule if they are siblings in the parse tree. This compositional process is achieved by performing inference on a linguistically driven conditional random field (CRF) and is performed across multiple graph capsule layers, which results in a compositional reasoning process inside a CNN. Experiments on the CLEVR dataset, CLEVR compositional generation test, and FigureQA dataset demonstrate the effectiveness and composition generalization ability of our end-to-end model.