论文标题
可达性限制了增强学习
Reachability Constrained Reinforcement Learning
论文作者
论文摘要
受限的加固学习(CRL)最近引起了人们的重大兴趣,因为安全限制满意度对于现实世界中的问题至关重要。但是,限制折现累积成本的现有CRL方法通常缺乏严格的定义和安全性的保证。相反,在安全控制研究中,安全被定义为持续满足某些状态限制。这种持续的安全性只有在称为可行的状态空间的子集中才有可能,在该套件中,对于给定环境存在最佳的最大可行集合。最近的研究将可行的集合与基于能量的方法(例如控制屏障功能(CBF),安全指数(SI))结合到CRL中,并利用了可行集合的先前保守性估计,从而损害了学习政策的绩效。为了解决这个问题,本文提出了可及性CRL(RCRL)方法,该方法通过使用可及性分析来建立新型的自洽条件并表征可行的集合。可行的集合由安全价值函数表示,该函数用作CRL中的约束。我们使用多时间刻度随机近似理论来证明所提出的算法会收敛到局部最佳,其中最大的可行集可以保证。与CRL和安全控制基线相比,不同基准的经验结果验证了RCRL的可行性集,政策性能和约束满意度。
Constrained reinforcement learning (CRL) has gained significant interest recently, since safety constraints satisfaction is critical for real-world problems. However, existing CRL methods constraining discounted cumulative costs generally lack rigorous definition and guarantee of safety. In contrast, in the safe control research, safety is defined as persistently satisfying certain state constraints. Such persistent safety is possible only on a subset of the state space, called feasible set, where an optimal largest feasible set exists for a given environment. Recent studies incorporate feasible sets into CRL with energy-based methods such as control barrier function (CBF), safety index (SI), and leverage prior conservative estimations of feasible sets, which harms the performance of the learned policy. To deal with this problem, this paper proposes the reachability CRL (RCRL) method by using reachability analysis to establish the novel self-consistency condition and characterize the feasible sets. The feasible sets are represented by the safety value function, which is used as the constraint in CRL. We use the multi-time scale stochastic approximation theory to prove that the proposed algorithm converges to a local optimum, where the largest feasible set can be guaranteed. Empirical results on different benchmarks validate the learned feasible set, the policy performance, and constraint satisfaction of RCRL, compared to CRL and safe control baselines.