论文标题
历史上的推理:上下文意识到的视觉对话框
Reasoning Over History: Context Aware Visual Dialog
论文作者
论文摘要
尽管已证明神经模型在单转视觉问题答案(VQA)任务上表现出很强的性能,从而将VQA扩展到了多转弯,但对话环境仍然是一个挑战。应对这一挑战的一种方法是通过使用该机制来增强现有的强神经VQA模型,使它们可以保留以前的对话框转弯中的信息。一个强大的VQA模型是MAC网络,该网络将任务分解为一系列基于注意力的推理步骤。但是,由于MAC网络是为单转问题回答而设计的,因此它无法参考过去的对话框。更具体地说,它在需要对对话历史记录(尤其是Coreference解决方案)上进行推理的任务挣扎。我们使用上下文感知的注意力和内存(CAM)扩展了Mac网络体系结构,该架构在过去的对话框中跨越控制状态,以确定当前问题的必要推理操作。具有CAM的MAC网络在CLEVR-DIALOG数据集上达到了98.25%的精度,使现有的最先进的ART击败了30%(绝对)。我们的错误分析表明,使用CAM,模型的性能在需要解决方案的问题上特别改进。
While neural models have been shown to exhibit strong performance on single-turn visual question answering (VQA) tasks, extending VQA to a multi-turn, conversational setting remains a challenge. One way to address this challenge is to augment existing strong neural VQA models with the mechanisms that allow them to retain information from previous dialog turns. One strong VQA model is the MAC network, which decomposes a task into a series of attention-based reasoning steps. However, since the MAC network is designed for single-turn question answering, it is not capable of referring to past dialog turns. More specifically, it struggles with tasks that require reasoning over the dialog history, particularly coreference resolution. We extend the MAC network architecture with Context-aware Attention and Memory (CAM), which attends over control states in past dialog turns to determine the necessary reasoning operations for the current question. MAC nets with CAM achieve up to 98.25% accuracy on the CLEVR-Dialog dataset, beating the existing state-of-the-art by 30% (absolute). Our error analysis indicates that with CAM, the model's performance particularly improved on questions that required coreference resolution.