论文标题

苏格拉底式模型:用语言构成零拍的多模式推理

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

论文作者

Zeng, Andy, Attarian, Maria, Ichter, Brian, Choromanski, Krzysztof, Wong, Adrian, Welker, Stefan, Tombari, Federico, Purohit, Aveek, Ryoo, Michael, Sindhwani, Vikas, Lee, Johnny, Vanhoucke, Vincent, Florence, Pete

论文摘要

大型预估计(例如,“基础”)模型表现出不同的功能,具体取决于训练的数据领域。尽管这些域是通用的,但它们可能几乎没有重叠。例如,视觉语言模型(VLM)是在互联网尺度图像标题上训练的,但是大型语言模型(LMS)是在没有图像的互联网规模文本上进一步培训的(例如,电子表格,SAT问题,代码)。结果,这些模型跨不同领域存储了不同形式的常识性知识。在这项工作中,我们表明这种多样性是共生的,可以通过苏格拉底式模型(SMS)来利用:一个模块化框架,其中可以将多个预告片的模型组成零拍摄,即通过多模态信息的提示,以相互交换并捕获新的多模式功能,而无需填充填充。凭借最少的工程,SMS不仅具有最先进的零摄像图像字幕和视频到视频检索的竞争力,而且还启用了新的应用程序,例如(i)回答自由形式的问题,(i)通过与人进行多模式辅助对话(例如,通过互动)(e.eg.,e.eg.,for Expers)(e.eg.,e.eg)(e.eg.,e.eg)(e.eg)(e.eg.,efe)(e。机器人的感知和计划。

Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源