论文标题
UTC:带有任务间的对比度学习的统一变压器用于视觉对话框
UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
论文作者
论文摘要
视觉对话框旨在根据对话记录和图像内容回答多轮的交互式问题。现有方法要么考虑答案排名和单独生成,要么仅通过两个单独的模型隐式地隐含了两个任务的关系。很少探索对共同学习在单个模型中排名和生成答案的通用框架的研究。在本文中,我们提出了一个基于学习的框架UTC,以通过单个模型在视觉对话中统一和促进歧视和生成任务。具体而言,考虑到以前的学习范式的固有局限性,我们设计了两个任务式对比损失,即背景对比损失并回答对比损失,以使歧视性和生成性任务相互加强。这两个互补的对比损失利用对话框上下文和目标答案,作为锚点,以从不同的角度提供表示信号。我们在Visdial V1.0数据集上评估了建议的UTC,我们的方法在歧视性和生成任务上都优于最先进的方法,并超过了先前的最先进的生成方法,在回忆@1上超过2个绝对点。
Visual Dialog aims to answer multi-round, interactive questions based on the dialog history and image content. Existing methods either consider answer ranking and generating individually or only weakly capture the relation across the two tasks implicitly by two separate models. The research on a universal framework that jointly learns to rank and generate answers in a single model is seldom explored. In this paper, we propose a contrastive learning-based framework UTC to unify and facilitate both discriminative and generative tasks in visual dialog with a single model. Specifically, considering the inherent limitation of the previous learning paradigm, we devise two inter-task contrastive losses i.e., context contrastive loss and answer contrastive loss to make the discriminative and generative tasks mutually reinforce each other. These two complementary contrastive losses exploit dialog context and target answer as anchor points to provide representation learning signals from different perspectives. We evaluate our proposed UTC on the VisDial v1.0 dataset, where our method outperforms the state-of-the-art on both discriminative and generative tasks and surpasses previous state-of-the-art generative methods by more than 2 absolute points on Recall@1.