无愿景的视觉上的计划：语言模型从高级说明中推断出详细的计划

论文标题

无愿景的视觉上的计划：语言模型从高级说明中推断出详细的计划

Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions

论文作者

Jansen, Peter A.

论文摘要

最近提出的Alfred挑战任务旨在使虚拟机器人代理在虚拟家庭环境中从高级自然语言指令中完成复杂的多步骤日常任务，例如“将热面包放在板上”。当前，表现最佳的模型能够成功完成这些任务的不到5％。在这项工作中，我们着重于将自然语言指令转换为在虚拟环境中实现这些目标的详细的多步骤序列的翻译问题。我们从经验上证明，在26％的看不见的情况下，只能单独从语言指令中生成黄金多步计划，而无需任何视觉输入。当合并少量的视觉信息时，即虚拟环境中的起始位置，我们表现最佳的GPT-2模型成功地生成了58％的情况下的金命令序列。我们的结果表明，情境化的语言模型可能为接地虚拟代理提供强大的视觉语义计划模块。

The recently proposed ALFRED challenge task aims for a virtual robotic agent to complete complex multi-step everyday tasks in a virtual home environment from high-level natural language directives, such as "put a hot piece of bread on a plate". Currently, the best-performing models are able to complete less than 5% of these tasks successfully. In this work we focus on modeling the translation problem of converting natural language directives into detailed multi-step sequences of actions that accomplish those goals in the virtual environment. We empirically demonstrate that it is possible to generate gold multi-step plans from language directives alone without any visual input in 26% of unseen cases. When a small amount of visual information is incorporated, namely the starting location in the virtual environment, our best-performing GPT-2 model successfully generates gold command sequences in 58% of cases. Our results suggest that contextualized language models may provide strong visual semantic planning modules for grounded virtual agents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题