论文标题
图值迭代
Graph Value Iteration
论文作者
论文摘要
近年来,深度强化学习(RL)在各种组合搜索领域(例如两人游戏和科学发现)中都取得了成功。但是,直接在计划域中应用深度RL仍然具有挑战性。一个主要的困难是,如果没有人工制作的启发式功能,则奖励信号将保持零,除非学习框架发现任何解决方案计划。随着计划的最小长度的增长,搜索空间变为\ emph {指数更大},这是计划实例的严重限制,该实例的最小计划长度为数百到数千步。以前的学习框架可以增强使用深神经网络和额外产生的子目标的图形搜索在各种具有挑战性的计划域中取得了成功。但是,生成有用的子目标需要广泛的领域知识。我们提出了一种与域无关的方法,该方法可以通过图值迭代来增强图形搜索,以求解域特有的求解器无法触及的硬计划实例。特别是,我们的方法还没有仅从发现的计划中获得学习信号,而是从未达到目标状态的失败搜索尝试中学习。图值迭代组件可以利用本地搜索空间的图形结构,并提供更有信息的学习信号。我们还展示了如何使用课程策略来平滑学习过程,并对图形值迭代尺度缩放并实现学习进行完整分析。
In recent years, deep Reinforcement Learning (RL) has been successful in various combinatorial search domains, such as two-player games and scientific discovery. However, directly applying deep RL in planning domains is still challenging. One major difficulty is that without a human-crafted heuristic function, reward signals remain zero unless the learning framework discovers any solution plan. Search space becomes \emph{exponentially larger} as the minimum length of plans grows, which is a serious limitation for planning instances with a minimum plan length of hundreds to thousands of steps. Previous learning frameworks that augment graph search with deep neural networks and extra generated subgoals have achieved success in various challenging planning domains. However, generating useful subgoals requires extensive domain knowledge. We propose a domain-independent method that augments graph search with graph value iteration to solve hard planning instances that are out of reach for domain-specialized solvers. In particular, instead of receiving learning signals only from discovered plans, our approach also learns from failed search attempts where no goal state has been reached. The graph value iteration component can exploit the graph structure of local search space and provide more informative learning signals. We also show how we use a curriculum strategy to smooth the learning process and perform a full analysis of how graph value iteration scales and enables learning.