论文标题
通过分层运营模型进行审议,在线计划和学习
Deliberative Acting, Online Planning and Learning with Hierarchical Operational Models
论文作者
论文摘要
在AI研究中,综合行动计划通常使用了动作的描述性模型,这些模型抽象地指定了由于动作而发生的情况,并且是针对有效计算状态过渡的有效计算的。但是,执行计划的动作需要操作模型,在这种模型中,丰富的计算控制结构和闭环在线决策用于指定如何在非确定执行环境中执行动作,对事件做出反应并适应不断发展的情况。将表演和计划整合的审议演员通常需要将这两种模型一起使用 - 在尝试开发不同模型,验证其一致性并顺利进行的表演和计划时,这会导致问题。 作为替代方案,我们定义并实施了一个集成的代理和计划系统,在该系统中,计划和代理都使用相同的操作模型。这些依赖于提供丰富控制结构的层次结构面向任务的改进方法。作用组件(称为反应性演奏引擎(RAE))受到众所周知的PRS系统的启发。在每个决策步骤中,RAE都可以从计划者那里获得有关公用事业功能的近乎最佳选择的建议。 Anytime Planner使用类似UCT的蒙特卡洛树搜索过程,称为Upom,其推出是对演员操作模型的模拟。我们还提出了与RAE和UPOM一起使用的学习策略,这些策略是从在线表演经验和/或模拟计划结果中获取的,从决策环境到方法实例的映射以及指导UPOM的启发式功能。我们证明了UPOM向静态域中最佳方法的渐近收敛性,并在实验上表明,UPOM和学习策略显着提高了作用效率和鲁棒性。
In AI research, synthesizing a plan of action has typically used descriptive models of the actions that abstractly specify what might happen as a result of an action, and are tailored for efficiently computing state transitions. However, executing the planned actions has needed operational models, in which rich computational control structures and closed-loop online decision-making are used to specify how to perform an action in a nondeterministic execution context, react to events and adapt to an unfolding situation. Deliberative actors, which integrate acting and planning, have typically needed to use both of these models together -- which causes problems when attempting to develop the different models, verify their consistency, and smoothly interleave acting and planning. As an alternative, we define and implement an integrated acting and planning system in which both planning and acting use the same operational models. These rely on hierarchical task-oriented refinement methods offering rich control structures. The acting component, called Reactive Acting Engine (RAE), is inspired by the well-known PRS system. At each decision step, RAE can get advice from a planner for a near-optimal choice with respect to a utility function. The anytime planner uses a UCT-like Monte Carlo Tree Search procedure, called UPOM, whose rollouts are simulations of the actor's operational models. We also present learning strategies for use with RAE and UPOM that acquire, from online acting experiences and/or simulated planning results, a mapping from decision contexts to method instances as well as a heuristic function to guide UPOM. We demonstrate the asymptotic convergence of UPOM towards optimal methods in static domains, and show experimentally that UPOM and the learning strategies significantly improve the acting efficiency and robustness.