论文标题
奖励机:在增强学习中利用奖励功能结构
Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning
论文作者
论文摘要
加强学习(RL)方法通常将奖励功能视为黑匣子。因此,这些方法必须与环境进行广泛的互动,以发现奖励和最佳策略。但是,在大多数RL应用程序中,用户必须对奖励功能进行编程,因此,有机会使奖励功能可见 - 向RL代理显示奖励功能的代码,以便它可以利用该功能的内部结构以更有效的方式学习最佳策略。在本文中,我们展示了如何通过两个步骤来完成这个想法。首先,我们提出奖励机,这是一种有限状态机器,在公开奖励功能结构的同时,支持奖励功能的规范。然后,我们描述了不同的方法来利用这种结构来支持学习,包括自动化的奖励成型,任务分解和反事实的推理。对表格和连续域的实验,跨不同任务和RL代理,显示了利用奖励结构在样本效率和结果策略质量方面的好处。最后,由于是有限状态机器的一种形式,奖励机具有常规语言的表达能力,因此支持循环,序列和条件以及典型的线性时间逻辑和非马克维亚奖励规范的时间扩展特性的表达。
Reinforcement learning (RL) methods usually treat reward functions as black boxes. As such, these methods must extensively interact with the environment in order to discover rewards and optimal policies. In most RL applications, however, users have to program the reward function and, hence, there is the opportunity to make the reward function visible -- to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies in a more sample efficient manner. In this paper, we show how to accomplish this idea in two steps. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions while exposing reward function structure. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning. Experiments on tabular and continuous domains, across different tasks and RL agents, show the benefits of exploiting reward structure with respect to sample efficiency and the quality of resultant policies. Finally, by virtue of being a form of finite state machine, reward machines have the expressive power of a regular language and as such support loops, sequences and conditionals, as well as the expression of temporally extended properties typical of linear temporal logic and non-Markovian reward specification.