论文标题
划分与蒙特的蒙特卡洛树搜索目标定向计划
Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning
论文作者
论文摘要
顺序决策制定的标准计划者(包括蒙特卡洛计划,树木搜索,动态编程等)受到隐性顺序计划假设的约束:构建计划的顺序与执行情况相同。我们考虑了该假设的替代方案,即目标指导的强化学习(RL)问题。我们假设一个不完美的,目标指导的策略,而不是环境过渡模型。该计划可以通过计划改进这种低级政策,包括从一开始到目标状态的适当子目标。我们提出了一种计划算法,分裂和构造蒙特卡洛树搜索(DC-MCT),以通过提出中间子目标近似最佳计划,从而将初始任务分层分配到更简单的任务中,然后将其分别分配为更简单的任务,然后将其独立求解和递归地求解。该算法批判性地利用了一项学习的次目标建议,以根据先前的经验找到适当的分区树。学习次目标建议的不同策略会产生不同的计划策略,这些策略严格概括了顺序计划。我们表明,这种对计划顺序的算法灵活性会导致在网格世界和挑战连续控制环境中的导航任务中的改进结果。
Standard planners for sequential decision making (including Monte Carlo planning, tree search, dynamic programming, etc.) are constrained by an implicit sequential planning assumption: The order in which a plan is constructed is the same in which it is executed. We consider alternatives to this assumption for the class of goal-directed Reinforcement Learning (RL) problems. Instead of an environment transition model, we assume an imperfect, goal-directed policy. This low-level policy can be improved by a plan, consisting of an appropriate sequence of sub-goals that guide it from the start to the goal state. We propose a planning algorithm, Divide-and-Conquer Monte Carlo Tree Search (DC-MCTS), for approximating the optimal plan by means of proposing intermediate sub-goals which hierarchically partition the initial tasks into simpler ones that are then solved independently and recursively. The algorithm critically makes use of a learned sub-goal proposal for finding appropriate partitions trees of new tasks based on prior experience. Different strategies for learning sub-goal proposals give rise to different planning strategies that strictly generalize sequential planning. We show that this algorithmic flexibility over planning order leads to improved results in navigation tasks in grid-worlds as well as in challenging continuous control environments.