用于加强学习的亚goal自动机的诱导和剥削

论文标题

用于加强学习的亚goal自动机的诱导和剥削

Induction and Exploitation of Subgoal Automata for Reinforcement Learning

论文作者

Furelos-Blanco, Daniel, Law, Mark, Jonsson, Anders, Broda, Krysia, Russo, Alessandra

论文摘要

在本文中，我们提出了ISA，这是一种学习和利用情节增强学习（RL）任务中的子目标的方法。 ISA通过诱导亚距自动机的诱导进行了增强学习，该自动机是一个自动机，其边缘被任务子观念标记为在一组高级事件上以命题逻辑公式表示。子观念自动机还包括两个特殊状态：表明任务成功完成的状态，以及指示任务已经完成的状态。最先进的归纳逻辑编程系统用于学习涵盖RL代理观察到的高级事件痕迹的亚距自动机。当当前利用的自动机无法正确识别跟踪时，自动机学习器会诱导覆盖该跟踪的新自动机。交织过程保证了最小数量的自动机的诱导，并应用对称性破坏机制来缩小搜索空间，同时保持完整。我们使用利用自动机结构的不同RL算法在几个网格世界和连续状态空间问题中评估ISA。我们根据迹线，对称性破坏和对最终可学习的自动机施加的特定限制提供了对自动机学习绩效的深入经验分析。对于每类RL问题，我们表明可以成功利用学习的自动机以学习达到目标的政策，从而获得了与没有学会自动机而不是手工制作和事先给出的自动机的平均奖励。

In this paper we present ISA, an approach for learning and exploiting subgoals in episodic reinforcement learning (RL) tasks. ISA interleaves reinforcement learning with the induction of a subgoal automaton, an automaton whose edges are labeled by the task's subgoals expressed as propositional logic formulas over a set of high-level events. A subgoal automaton also consists of two special states: a state indicating the successful completion of the task, and a state indicating that the task has finished without succeeding. A state-of-the-art inductive logic programming system is used to learn a subgoal automaton that covers the traces of high-level events observed by the RL agent. When the currently exploited automaton does not correctly recognize a trace, the automaton learner induces a new automaton that covers that trace. The interleaving process guarantees the induction of automata with the minimum number of states, and applies a symmetry breaking mechanism to shrink the search space whilst remaining complete. We evaluate ISA in several gridworld and continuous state space problems using different RL algorithms that leverage the automaton structures. We provide an in-depth empirical analysis of the automaton learning performance in terms of the traces, the symmetry breaking and specific restrictions imposed on the final learnable automaton. For each class of RL problem, we show that the learned automata can be successfully exploited to learn policies that reach the goal, achieving an average reward comparable to the case where automata are not learned but handcrafted and given beforehand.

下载PDF全文

下载文献需遵守相关版权规定

论文标题