符号重播：场景图作为在VQA任务上连续学习的提示

论文标题

符号重播：场景图作为在VQA任务上连续学习的提示

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task

论文作者

Lei, Stan Weixian, Gao, Difei, Wu, Jay Zhangjie, Wang, Yuxuan, Liu, Wei, Zhang, Mengmi, Shou, Mike Zheng

论文摘要

VQA是一项雄心勃勃的任务，旨在回答任何与图像有关的问题。但是，实际上，由于用户的需求不断更新，并且该系统必须实施新功能，因此很难为所有人构建这样的系统。因此，持续学习（CL）能力是开发高级VQA系统的必要条件。最近，先锋工作将一个VQA数据集分为不同的答案集以研究此主题。但是，VQA上的CL不仅涉及标签集的扩展（新答案集）。在将VQA系统部署到新环境（新的视觉场景）以及如何回答需要新功能的问题（新问题类型）时，研究如何回答问题至关重要。因此，我们提出了丁香，这是一个在视觉问题回答上持续学习的基准，其中包含上述两个CL方案的场景和功能收入设置。在方法论方面，CL在VQA和分类之间的主要区别在于，前者还涉及扩大和防止忘记推理机制，而后者则集中于班级表示。因此，我们提出了一种为CL上量身定制的基于无数据的基于Real-DATA的方法，称为场景图作为符号重播的提示。它使用一段场景图作为提示，它可以重播伪场景图，以表示过去的图像以及相关的QA对。还提出了一个统一的VQA模型来利用当前的数据并重播数据来增强其质量检查能力。最后，实验结果揭示了丁香的挑战，并证明了我们方法的有效性。数据集和代码将在https://github.com/showlab/clvqa上找到。

VQA is an ambitious task aiming to answer any image-related question. However, in reality, it is hard to build such a system once for all since the needs of users are continuously updated, and the system has to implement new functions. Thus, Continual Learning (CL) ability is a must in developing advanced VQA systems. Recently, a pioneer work split a VQA dataset into disjoint answer sets to study this topic. However, CL on VQA involves not only the expansion of label sets (new Answer sets). It is crucial to study how to answer questions when deploying VQA systems to new environments (new Visual scenes) and how to answer questions requiring new functions (new Question types). Thus, we propose CLOVE, a benchmark for Continual Learning On Visual quEstion answering, which contains scene- and function-incremental settings for the two aforementioned CL scenarios. In terms of methodology, the main difference between CL on VQA and classification is that the former additionally involves expanding and preventing forgetting of reasoning mechanisms, while the latter focusing on class representation. Thus, we propose a real-data-free replay-based method tailored for CL on VQA, named Scene Graph as Prompt for Symbolic Replay. Using a piece of scene graph as a prompt, it replays pseudo scene graphs to represent the past images, along with correlated QA pairs. A unified VQA model is also proposed to utilize the current and replayed data to enhance its QA ability. Finally, experimental results reveal challenges in CLOVE and demonstrate the effectiveness of our method. The dataset and code will be available at https://github.com/showlab/CLVQA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题