思考全球，本地ACT：视觉和语言导航的双尺度图形变压器

论文标题

思考全球，本地ACT：视觉和语言导航的双尺度图形变压器

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

论文作者

Chen, Shizhe, Guhur, Pierre-Louis, Tapaswi, Makarand, Schmid, Cordelia, Laptev, Ivan

论文摘要

按照语言说明在看不见的环境中导航是一个具有挑战性的问题，对于自主体现的代理人而言。代理商不仅需要在视觉场景中进行地面语言，而且还应探索环境达到目标。在这项工作中，我们提出了一个双尺度的图形变压器（DUET），用于联合长期行动计划和细粒度的跨模式理解。我们可以立即构建拓扑图，以实现在全球动作领域的有效探索。为了平衡大型动作空间推理的复杂性和细粒语言接地的复杂性，我们通过图形变压器在全局地图上动态地结合了对本地观测值的精细编码和粗尺度编码。拟议的方法，二重奏大大超过了面向目标的视力和语言导航（VLN）基准的最先进方法。它还提高了细粒VLN基准R2R的成功率。

Following language instructions to navigate in unseen environments is a challenging problem for autonomous embodied agents. The agent not only needs to ground languages in visual scenes, but also should explore the environment to reach its target. In this work, we propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation (VLN) benchmarks REVERIE and SOON. It also improves the success rate on the fine-grained VLN benchmark R2R.

下载PDF全文

下载文献需遵守相关版权规定

论文标题