论文标题

Shiro:软层次的增强学习

SHIRO: Soft Hierarchical Reinforcement Learning

论文作者

Watanabe, Kandai, Strong, Mathew, Eldar, Omer

论文摘要

层次增强学习(HRL)算法已被证明在高维决策和机器人控制任务上表现良好。但是,由于它们仅对奖励进行优化,因此代理倾向于冗余地搜索相同的空间。这个问题降低了学习速度并获得了回报。在这项工作中,我们提出了一种非政策HRL算法,该算法最大化熵以进行有效探索。该算法学习了一个暂时抽象的低级政策,并能够通过在高水平上增加熵来广泛探索。这项工作的新颖性是将熵添加到HRL设置中的RL目标的理论动机。我们从经验上表明,如果低级策略的连续更新之间的kullback-leibler(KL)差异足够小,则可以将熵添加到两个级别。我们进行了一项消融研究,以分析熵对层次结构的影响,其中将熵添加到高水平的熵中是最令人期望的构型。此外,低水平的温度较高会导致Q值高估,并增加了高级环境的随机性,从而使学习更具挑战性。我们的方法Shiro在一系列模拟机器人控制基准任务上超过了最先进的性能,需要最小的调整。

Hierarchical Reinforcement Learning (HRL) algorithms have been demonstrated to perform well on high-dimensional decision making and robotic control tasks. However, because they solely optimize for rewards, the agent tends to search the same space redundantly. This problem reduces the speed of learning and achieved reward. In this work, we present an Off-Policy HRL algorithm that maximizes entropy for efficient exploration. The algorithm learns a temporally abstracted low-level policy and is able to explore broadly through the addition of entropy to the high-level. The novelty of this work is the theoretical motivation of adding entropy to the RL objective in the HRL setting. We empirically show that the entropy can be added to both levels if the Kullback-Leibler (KL) divergence between consecutive updates of the low-level policy is sufficiently small. We performed an ablative study to analyze the effects of entropy on hierarchy, in which adding entropy to high-level emerged as the most desirable configuration. Furthermore, a higher temperature in the low-level leads to Q-value overestimation and increases the stochasticity of the environment that the high-level operates on, making learning more challenging. Our method, SHIRO, surpasses state-of-the-art performance on a range of simulated robotic control benchmark tasks and requires minimal tuning.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源