杯：保守的更新政策算法，用于安全加固学习

论文标题

杯：保守的更新政策算法，用于安全加固学习

CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning

论文作者

Yang, Long, Ji, Jiaming, Dai, Juntao, Zhang, Yu, Li, Pengfei, Pan, Gang

论文摘要

安全的加固学习（RL）仍然非常具有挑战性，因为它要求代理商考虑返回最大化和安全探索。在本文中，我们提出了杯子，这是一种保守的更新政策算法，具有理论安全保证。我们根据新提出的性能边界和替代功能得出杯子。尽管以某些现有作品的形式出现了使用界限作为设计安全RL算法的替代函数，但我们至少开发了三个方面：（i）我们提供了严格的理论分析，以将替代功能扩展到广义优势估计器（GAE）。 GAE在维持可忍受的偏见的同时，大大减少了差异，这对于我们设计杯子来说是一个有效的步骤。（ii）所提出的界限比现有工作更紧，即将所提出的界限用作替代功能是对客观和安全限制的局部近似值。（iii）拟议的杯子通过一阶优化器提供了非凸的实现，这不取决于任何凸近似。最后，广泛的实验显示了杯子的有效性，而杯子可以满足安全的限制。我们已经在https://github.com/rl-boxes/safe-rl上打开了CUP的源代码。

Safe reinforcement learning (RL) is still very challenging since it requires the agent to consider both return maximization and safe exploration. In this paper, we propose CUP, a Conservative Update Policy algorithm with a theoretical safety guarantee. We derive the CUP based on the new proposed performance bounds and surrogate functions. Although using bounds as surrogate functions to design safe RL algorithms have appeared in some existing works, we develop them at least three aspects: (i) We provide a rigorous theoretical analysis to extend the surrogate functions to generalized advantage estimator (GAE). GAE significantly reduces variance empirically while maintaining a tolerable level of bias, which is an efficient step for us to design CUP; (ii) The proposed bounds are tighter than existing works, i.e., using the proposed bounds as surrogate functions are better local approximations to the objective and safety constraints. (iii) The proposed CUP provides a non-convex implementation via first-order optimizers, which does not depend on any convex approximation. Finally, extensive experiments show the effectiveness of CUP where the agent satisfies safe constraints. We have opened the source code of CUP at https://github.com/RL-boxes/Safe-RL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题