目标内核计划：线性可溶的非马克维亚政策，用于具有目标条件选项的逻辑任务

论文标题

目标内核计划：线性可溶的非马克维亚政策，用于具有目标条件选项的逻辑任务

Goal Kernel Planning: Linearly-Solvable Non-Markovian Policies for Logical Tasks with Goal-Conditioned Options

论文作者

Ringstrom, Thomas J., Hasanbeig, Mohammadhosein, Abate, Alessandro

论文摘要

在层次计划的领域中，组成性，抽象和任务转移对于设计算法至关重要，这些算法可以有效地解决最大代表性重复使用的各种问题。许多现实世界中的问题需要非马克维亚政策来处理具有逻辑条件的复杂结构化任务，通常会导致较大的状态表示。这需要有效的方法来解决这些问题，并在任务之间重复使用结构。为此，我们介绍了一个称为线性可溶解目标内核动态编程（LS-GKDP）的组成框架，以解决求解具有订单约束的非马克维亚布尔亚赛次目标任务的复杂性。 LS-GKDP结合了可溶解的马尔可夫决策过程（LMDP）形式主义与强化学习的选项框架。可以有效地将LMDP作为主要特征向量问题解决，并且选项是具有终止条件的策略，用作时间扩展的动作；使用LS-GKDP，我们扩展了LMDP，以控制逻辑任务的选项。这涉及将高维问题分解为每个目标的一组目标条件选项并构建目标内核，这是一个抽象的过渡内核，从选项的初始状态跳到其终止状态，以及更新高级任务状态。我们展示了具有目标内核的LMDP如何在由任务接地定义的较低维子空间中有效地优化元元素。选项也可以在没有重大重新计算的任务的超指数空间内重新映射到新问题，我们确定了解决方案不变的案例，可以使任务接地不变，从而允许零摄像任务转移。

In the domain of hierarchical planning, compositionality, abstraction, and task transfer are crucial for designing algorithms that can efficiently solve a variety of problems with maximal representational reuse. Many real-world problems require non-Markovian policies to handle complex structured tasks with logical conditions, often leading to prohibitively large state representations; this requires efficient methods for breaking these problems down and reusing structure between tasks. To this end, we introduce a compositional framework called Linearly-Solvable Goal Kernel Dynamic Programming (LS-GKDP) to address the complexity of solving non-Markovian Boolean sub-goal tasks with ordering constraints. LS-GKDP combines the Linearly-Solvable Markov Decision Process (LMDP) formalism with the Options Framework of Reinforcement Learning. LMDPs can be efficiently solved as a principal eigenvector problem, and options are policies with termination conditions used as temporally extended actions; with LS-GKDP we expand LMDPs to control over options for logical tasks. This involves decomposing a high-dimensional problem down into a set of goal-condition options for each goal and constructing a goal kernel, which is an abstract transition kernel that jumps from an option's initial-states to its termination-states along with an update of the higher-level task-state. We show how an LMDP with a goal kernel enables the efficient optimization of meta-policies in a lower-dimensional subspace defined by the task grounding. Options can also be remapped to new problems within a super-exponential space of tasks without significant recomputation, and we identify cases where the solution is invariant to the task grounding, permitting zero-shot task transfer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题