论文标题
世界模型作为图表:学习计划的潜在地标
World Model as a Graph: Learning Latent Landmarks for Planning
论文作者
论文摘要
计划 - 分析大型问题的结构并将其分解为相互关联的子问题的能力 - 是人类智能的标志。虽然深度强化学习(RL)在解决相对直接的控制任务方面表现出了巨大的希望,但仍然是一个开放的问题,如何最好地将计划最佳地纳入现有的深度RL范式中,以处理日益复杂的环境。一个基于模型的RL的一个突出的框架可以使用分步虚拟推广来学习世界模型和计划。当计划范围增加时,这种类型的世界模型与现实迅速不同,因此在长途计划中挣扎。我们如何学习世界模型,以赋予代理商进行时间扩展推理的能力?在这项工作中,我们建议学习由稀疏的多步过渡组成的图形结构化世界模型。我们设计了一种新颖的算法来学习散布在目标空间的潜在地标(以可及性为方面)作为图表上的节点。在同一图中,边缘是从Q-功能中提取的可及性估计值。在从机器人操纵到导航的各种高维连续控制任务上,我们证明了我们的方法,即L3P,命名为L3P,显着胜过先前的工作,并且通常是唯一能够利用模型的RL和Graph-Search-Search Algorithms概括的方法。我们认为,我们的工作是迈向加强学习的可扩展计划的重要一步。
Planning - the ability to analyze the structure of a problem in the large and decompose it into interrelated subproblems - is a hallmark of human intelligence. While deep reinforcement learning (RL) has shown great promise for solving relatively straightforward control tasks, it remains an open problem how to best incorporate planning into existing deep RL paradigms to handle increasingly complex environments. One prominent framework, Model-Based RL, learns a world model and plans using step-by-step virtual rollouts. This type of world model quickly diverges from reality when the planning horizon increases, thus struggling at long-horizon planning. How can we learn world models that endow agents with the ability to do temporally extended reasoning? In this work, we propose to learn graph-structured world models composed of sparse, multi-step transitions. We devise a novel algorithm to learn latent landmarks that are scattered (in terms of reachability) across the goal space as the nodes on the graph. In this same graph, the edges are the reachability estimates distilled from Q-functions. On a variety of high-dimensional continuous control tasks ranging from robotic manipulation to navigation, we demonstrate that our method, named L3P, significantly outperforms prior work, and is oftentimes the only method capable of leveraging both the robustness of model-free RL and generalization of graph-search algorithms. We believe our work is an important step towards scalable planning in reinforcement learning.