安全策略优化的限制更新投影方法

论文标题

安全策略优化的限制更新投影方法

Constrained Update Projection Approach to Safe Policy Optimization

论文作者

Yang, Long, Ji, Jiaming, Dai, Juntao, Zhang, Linrui, Zhou, Binbin, Li, Pengfei, Yang, Yaodong, Pan, Gang

论文摘要

安全加强学习（RL）研究问题，即智能代理不仅必须最大程度地提高奖励，而且还必须避免探索不安全的领域。在这项研究中，我们提出了CUP，这是一种基于受限的更新投影框架的新型政策优化方法，享有严格的安全保证。我们杯子开发的核心是新提出的替代功能以及性能结合。与以前的安全RL方法相比，杯子享有1）杯子的好处1）杯子将替代功能推广到广义优势估计量（GAE），从而导致强大的经验性能。 2）杯赛统一性界限，为某些现有算法提供更好的理解和解释性； 3）CUP仅通过一阶优化器提供非凸的实现，该优化器不需要在目标的凸面上进行任何强近似。为了验证我们的杯子方法，我们将杯子与在各种任务上进行的安全RL基线的全面列表进行了比较。实验在奖励和安全限制满意度方面表明了杯子的有效性。我们已经在此链接上打开了CUP的源代码https://github.com/zmsn-2077/ cup-safe-rl。

Safe reinforcement learning (RL) studies problems where an intelligent agent has to not only maximize reward but also avoid exploring unsafe areas. In this study, we propose CUP, a novel policy optimization method based on Constrained Update Projection framework that enjoys rigorous safety guarantee. Central to our CUP development is the newly proposed surrogate functions along with the performance bound. Compared to previous safe RL methods, CUP enjoys the benefits of 1) CUP generalizes the surrogate functions to generalized advantage estimator (GAE), leading to strong empirical performance. 2) CUP unifies performance bounds, providing a better understanding and interpretability for some existing algorithms; 3) CUP provides a non-convex implementation via only first-order optimizers, which does not require any strong approximation on the convexity of the objectives. To validate our CUP method, we compared CUP against a comprehensive list of safe RL baselines on a wide range of tasks. Experiments show the effectiveness of CUP both in terms of reward and safety constraint satisfaction. We have opened the source code of CUP at this link https://github.com/zmsn-2077/ CUP-safe-rl.

下载PDF全文

下载文献需遵守相关版权规定

论文标题