论文标题
通过好奇的探索灵活而有效的远程计划
Flexible and Efficient Long-Range Planning Through Curious Exploration
论文作者
论文摘要
识别灵活有效地发现时间扩展的多相计划的算法是发展机器人技术和基于模型的增强学习的重要步骤。远程计划的核心问题是找到一种有效的方法来搜索可能的动作序列树。来自任务和运动计划(TAMP)文献中现有的非学习计划解决方案依赖于对行动的影响和先决条件的逻辑描述的存在。该约束允许tamp方法有效地减少树木搜索问题,但限制了其概括地看不见且复杂的物理环境的能力。相比之下,深度强化学习(DRL)方法使用灵活的基于神经网络的功能近似器来发现自然而然地将其推广到看不见的情况的政策。但是,DRL方法难以处理远程多步计划固有的非常稀疏的奖励景观。在这里,我们提出了好奇的样本计划者(CSP),该计划者通过将好奇心引导的抽样策略与模仿学习以加速计划来融合坦普和DRL的要素。我们表明,CSP可以有效地发现有趣且复杂的时间扩展的计划,以解决各种物理现实的3D任务。相比之下,标准的计划和学习方法通常根本无法解决这些任务,或者仅使用大量且高度可变的培训样本来完成这些任务。我们探讨了与CSP一起使用各种好奇心指标的使用,并分析了CSP发现的解决方案的类型。最后,我们表明CSP支持任务转移,以便在一项任务经验期间学到的勘探政策可以帮助提高相关任务的效率。
Identifying algorithms that flexibly and efficiently discover temporally-extended multi-phase plans is an essential step for the advancement of robotics and model-based reinforcement learning. The core problem of long-range planning is finding an efficient way to search through the tree of possible action sequences. Existing non-learned planning solutions from the Task and Motion Planning (TAMP) literature rely on the existence of logical descriptions for the effects and preconditions for actions. This constraint allows TAMP methods to efficiently reduce the tree search problem but limits their ability to generalize to unseen and complex physical environments. In contrast, deep reinforcement learning (DRL) methods use flexible neural-network-based function approximators to discover policies that generalize naturally to unseen circumstances. However, DRL methods struggle to handle the very sparse reward landscapes inherent to long-range multi-step planning situations. Here, we propose the Curious Sample Planner (CSP), which fuses elements of TAMP and DRL by combining a curiosity-guided sampling strategy with imitation learning to accelerate planning. We show that CSP can efficiently discover interesting and complex temporally-extended plans for solving a wide range of physically realistic 3D tasks. In contrast, standard planning and learning methods often fail to solve these tasks at all or do so only with a huge and highly variable number of training samples. We explore the use of a variety of curiosity metrics with CSP and analyze the types of solutions that CSP discovers. Finally, we show that CSP supports task transfer so that the exploration policies learned during experience with one task can help improve efficiency on related tasks.