基于声明的及时调整视觉问题回答

论文标题

基于声明的及时调整视觉问题回答

Declaration-based Prompt Tuning for Visual Question Answering

论文作者

Liu, Yuhang, Wei, Wei, Peng, Daowan, Zhu, Feida

论文摘要

近年来，预训练 - 然后进行调整的范式在各种跨模式的任务上取得了巨大的成功，例如视觉质疑答案（VQA），其中最初通过自我操纵的目标对象进行了视觉语言（VL）模型，然后通过自我实现的任务对象进行优化下游任务（例如VQA）通过全新的目标函数，例如答案预测。目标形式的不一致不仅严重限制了预训练的VL模型在下游任务中的概括，而且还需要大量的标记数据进行微调。为了减轻问题，我们提出了一种创新的VL微调范式（称为基于声明的及时调整，缩写为DPT），该范式共同优化了VQA模型的预培训和微调的目标，从而提高了预先训练的VL模型对下游任务的有效适应。具体而言，DPT通过（1）文本适应重新重新设计了VQA任务的客观形式，该文本适应将给定的问题转换为声明性句子形式以进行及时调整，以及（2）任务适应，该任务适应以预先培训阶段的方式优化了VQA问题的客观功能。 GQA数据集的实验结果表明，DPT在完全监督的（2.68％）和零射击/少数拍摄（超过31％）的设置方面的准确性都超过了微调的对应。所有数据和代码都将用于促进未来的研究。

In recent years, the pre-training-then-fine-tuning paradigm has yielded immense success on a wide spectrum of cross-modal tasks, such as visual question answering (VQA), in which a visual-language (VL) model is first optimized via self-supervised task objectives, e.g., masked language modeling (MLM) and image-text matching (ITM), and then fine-tuned to adapt to downstream task (e.g., VQA) via a brand-new objective function, e.g., answer prediction. The inconsistency of the objective forms not only severely limits the generalization of pre-trained VL models to downstream tasks, but also requires a large amount of labeled data for fine-tuning. To alleviate the problem, we propose an innovative VL fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT), which jointly optimizes the objectives of pre-training and fine-tuning of VQA model, boosting the effective adaptation of pre-trained VL models to the downstream task. Specifically, DPT reformulates the objective form of VQA task via (1) textual adaptation, which converts the given questions into declarative sentence-form for prompt-tuning, and (2) task adaptation, which optimizes the objective function of VQA problem in the manner of pre-training phase. Experimental results on GQA dataset show that DPT outperforms the fine-tuned counterpart by a large margin regarding accuracy in both fully-supervised (2.68%) and zero-shot/few-shot (over 31%) settings. All the data and codes will be available to facilitate future research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题