论文标题
从图像到文本提示:带冷冻大语言模型的零击VQA
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
论文作者
论文摘要
大型语言模型(LLMS)已显示出对新语言任务的极好的零拍概括。但是,有效利用LLM用于零击视觉询问解答(VQA)仍然具有挑战性,这主要是由于LLM和VQA任务之间的方式断开连接和任务断开连接。视觉和语言数据的端到端培训可能会弥合断开连接,但僵化且计算昂贵。为了解决这个问题,我们建议\ emph {img2prompt},这是一个插件模块,提供了可以桥接上述模式和任务断开连接的提示,以便LLMS可以执行零摄像的VQA任务而无需端到端培训。为了提供此类提示,我们进一步采用LLM-AGNOSTIC模型来提供可以描述图像内容和自我结构的问题解答对的提示,这些提问可以有效地指导LLM执行零摄像的VQA任务。 IMG2 Prompt提供以下好处:1)它可以灵活地与各种LLM一起使用VQA。 2)〜无需端到端培训,它大大降低了为零射击VQA任务部署LLM的成本。 3)与依赖端到端培训的方法相比,它取得了可比或更好的性能。例如,我们在VQAV2上胜过5.6 \%的flamingo \ cite {deepmind:flamingo2022}。在具有挑战性的A-OKVQA数据集上,我们的方法甚至胜过几种方法多达20 \%。
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo \cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.