以目标为导向的视觉对话的答案驱动的视觉状态估计器

论文标题

以目标为导向的视觉对话的答案驱动的视觉状态估计器

Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

论文作者

Xu, Zipeng, Feng, Fangxiang, Wang, Xiaojie, Yang, Yushu, Jiang, Huixing, Wang, Zhongyuan

论文摘要

以目标为导向的视觉对话涉及两个代理商（提问者和甲骨文）之间的多转交互。在此期间，甲骨文给出的答案具有重要意义，因为它对提问者的关注提供了黄金回应。根据答案，发问者更新了对目标视觉内容的信念，并进一步提出了另一个问题。值得注意的是，不同的答案涉足不同的视觉信念和未来的问题。但是，现有的方法总是在更长的问题之后不加选择地编码答案，从而导致答案的利用率较弱。在本文中，我们提出了一个以答案为导向的视觉状态估计器（ADVSE），以强加不同答案对视觉状态的影响。首先，我们提出了一个以答案为导向的聚焦注意力（ADFA），以通过提高与问题相关的注意力并通过每个回合通过基于答案的逻辑操作来捕获对视觉注意的影响。然后，基于聚焦的注意力，我们通过条件视觉信息融合（CVIF）获得视觉状态估计，其中总体信息和差异信息融合了问题 - 答案状态。我们将提议的Advse评估为“问题发生器”和“猜测”任务，这是什么？数据集并在这两个任务上实现最先进的性能。定性结果表明，ADVSE在合理的问题生成和猜测过程中促进了代理人产生高效的问题并获得可靠的视觉关注。

A goal-oriented visual dialogue involves multi-turn interactions between two agents, Questioner and Oracle. During which, the answer given by Oracle is of great significance, as it provides golden response to what Questioner concerns. Based on the answer, Questioner updates its belief on target visual content and further raises another question. Notably, different answers drive into different visual beliefs and future questions. However, existing methods always indiscriminately encode answers after much longer questions, resulting in a weak utilization of answers. In this paper, we propose an Answer-Driven Visual State Estimator (ADVSE) to impose the effects of different answers on visual states. First, we propose an Answer-Driven Focusing Attention (ADFA) to capture the answer-driven effect on visual attention by sharpening question-related attention and adjusting it by answer-based logical operation at each turn. Then based on the focusing attention, we get the visual state estimation by Conditional Visual Information Fusion (CVIF), where overall information and difference information are fused conditioning on the question-answer state. We evaluate the proposed ADVSE to both question generator and guesser tasks on the large-scale GuessWhat?! dataset and achieve the state-of-the-art performances on both tasks. The qualitative results indicate that the ADVSE boosts the agent to generate highly efficient questions and obtains reliable visual attentions during the reasonable question generation and guess processes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题