视觉和语言导航的多视图学习

论文标题

视觉和语言导航的多视图学习

Multi-View Learning for Vision-and-Language Navigation

论文作者

Xia, Qiaolin, Li, Xiujun, Li, Chunyuan, Bisk, Yonatan, Sui, Zhifang, Gao, Jianfeng, Choi, Yejin, Smith, Noah A.

论文摘要

学习按照自然语言指示在视觉环境中进行导航是一项具有挑战性的任务，因为自然语言指令是高度可变，模棱两可和指定的。在本文中，我们提出了一个新颖的培训范式，向所有人学习（LEO），该范式利用多个说明（作为不同的观点）来解决同一轨迹以解决语言歧义并改善概括。通过跨说明共享参数，我们的方法可以从有限的培训数据中更有效地学习，并在看不见的环境中更好地概括。在最近的房间对房间（R2R）基准数据集中，Leo的成功率（25.3％$ \ rightArrow $ 41.4％）的成功率（25.3％$ \ rightArrow $ 41.4％）取得了16％的提高（绝对），以路径长度（SPL）加权。此外，狮子座与大多数现有的视觉和语言导航模型相辅相成，可以轻松与现有技术集成，从而导致Leo++，从而创造了新的最新状态，将R2R基准提高到62％（绝对改进9％）。

Learning to navigate in a visual environment following natural language instructions is a challenging task because natural language instructions are highly variable, ambiguous, and under-specified. In this paper, we present a novel training paradigm, Learn from EveryOne (LEO), which leverages multiple instructions (as different views) for the same trajectory to resolve language ambiguity and improve generalization. By sharing parameters across instructions, our approach learns more effectively from limited training data and generalizes better in unseen environments. On the recent Room-to-Room (R2R) benchmark dataset, LEO achieves 16% improvement (absolute) over a greedy agent as the base agent (25.3% $\rightarrow$ 41.4%) in Success Rate weighted by Path Length (SPL). Further, LEO is complementary to most existing models for vision-and-language navigation, allowing for easy integration with the existing techniques, leading to LEO+, which creates the new state of the art, pushing the R2R benchmark to 62% (9% absolute improvement).

下载PDF全文

下载文献需遵守相关版权规定

论文标题