ORD：视觉对话生成的对象关系发现

论文标题

ORD：视觉对话生成的对象关系发现

ORD: Object Relationship Discovery for Visual Dialogue Generation

论文作者

Wang, Ziwei, Huang, Zi, Luo, Yadan, Lu, Huimin

论文摘要

随着图像字幕和视觉问题在单一级别上的快速发展，如何生成有关视觉内容的多轮对话的问题尚未得到充分探索。存在视觉对话方法将图像直接编码为固定特征矢量，并与嵌入的问题和嵌入的问题相关。使用响应的问题，使用共同的问题，以互换的方式进行交叉，从而涉及共同的问题。图像，历史和目标问题。但是，推理视觉关系仍然具有挑战性，因为在共同参与推理之前省略了细粒度的对象级信息。在本文中，我们提出了一个对象关系发现（ORD）框架，以保留视觉对话生成的对象相互作用。具体而言，提出了分层图卷积网络（HIERGCN）来保留对象节点和邻居关系，然后在全球范围内完善对象对象连接以获得最终的图形嵌入。进一步合并了图形注意力，以在响应推理阶段动态地参与此图结构化表示。广泛的实验证明，提出的方法可以通过利用视觉关系的上下文信息来显着提高对话的质量。该模型在视觉对话数据集中的最先进方法实现了卓越的性能，将MRR从0.6222增加到0.6447，并将@1从48.48％召回到51.22％。

With the rapid advancement of image captioning and visual question answering at single-round level, the question of how to generate multi-round dialogue about visual content has not yet been well explored.Existing visual dialogue methods encode the image into a fixed feature vector directly, concatenated with the question and history embeddings to predict the response.Some recent methods tackle the co-reference resolution problem using co-attention mechanism to cross-refer relevant elements from the image, history, and the target question.However, it remains challenging to reason visual relationships, since the fine-grained object-level information is omitted before co-attentive reasoning. In this paper, we propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation. Specifically, a hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally to obtain the final graph embeddings. A graph attention is further incorporated to dynamically attend to this graph-structured representation at the response reasoning stage. Extensive experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships. The model achieves superior performance over the state-of-the-art methods on the Visual Dialog dataset, increasing MRR from 0.6222 to 0.6447, and recall@1 from 48.48% to 51.22%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题