从图像到文本提示：带冷冻大语言模型的零击VQA

论文标题

从图像到文本提示：带冷冻大语言模型的零击VQA

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

论文作者

Guo, Jiaxian, Li, Junnan, Li, Dongxu, Tiong, Anthony Meng Huat, Li, Boyang, Tao, Dacheng, Hoi, Steven C. H.

论文摘要

大型语言模型（LLMS）已显示出对新语言任务的极好的零拍概括。但是，有效利用LLM用于零击视觉询问解答（VQA）仍然具有挑战性，这主要是由于LLM和VQA任务之间的方式断开连接和任务断开连接。视觉和语言数据的端到端培训可能会弥合断开连接，但僵化且计算昂贵。为了解决这个问题，我们建议\ emph {img2prompt}，这是一个插件模块，提供了可以桥接上述模式和任务断开连接的提示，以便LLMS可以执行零摄像的VQA任务而无需端到端培训。为了提供此类提示，我们进一步采用LLM-AGNOSTIC模型来提供可以描述图像内容和自我结构的问题解答对的提示，这些提问可以有效地指导LLM执行零摄像的VQA任务。 IMG2 Prompt提供以下好处：1）它可以灵活地与各种LLM一起使用VQA。 2）〜无需端到端培训，它大大降低了为零射击VQA任务部署LLM的成本。 3）与依赖端到端培训的方法相比，它取得了可比或更好的性能。例如，我们在VQAV2上胜过5.6 \％的flamingo \ cite {deepmind：flamingo2022}。在具有挑战性的A-OKVQA数据集上，我们的方法甚至胜过几种方法多达20 \％。

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo \cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题