论文标题
确保RL:几乎确定的约束的加固学习
Assured RL: Reinforcement Learning with Almost Sure Constraints
论文作者
论文摘要
我们考虑为马尔可夫决策过程找到最佳策略的问题,该政策几乎可以确定对国家过渡和行动三胞胎的限制。我们定义了满足基于障碍的分解的价值和行动价值功能,该函数允许独立于奖励过程来识别可行的政策。我们证明,鉴于政策π,证明某些状态行动对是否导致π下的可行轨迹等同于解决旨在找到执行不可行的过渡的可能性的辅助问题。使用这种解释,我们基于Q学习的屏障学习算法,该算法确定了这种不安全的状态行动对。我们的分析激发了需要增强加固学习(RL)框架的需求,除了奖励,在此处称为损害功能,该功能提供了可行性信息,并可以通过无模型约束来解决RL问题。此外,我们的屏障学习算法围绕现有的RL算法(例如Q-Learning和SARSA),使他们能够解决几乎受到约束的问题。
We consider the problem of finding optimal policies for a Markov Decision Process with almost sure constraints on state transitions and action triplets. We define value and action-value functions that satisfy a barrier-based decomposition which allows for the identification of feasible policies independently of the reward process. We prove that, given a policy π, certifying whether certain state-action pairs lead to feasible trajectories under π is equivalent to solving an auxiliary problem aimed at finding the probability of performing an unfeasible transition. Using this interpretation,we develop a Barrier-learning algorithm, based on Q-Learning, that identifies such unsafe state-action pairs. Our analysis motivates the need to enhance the Reinforcement Learning (RL) framework with an additional signal, besides rewards, called here damage function that provides feasibility information and enables the solution of RL problems with model-free constraints. Moreover, our Barrier-learning algorithm wraps around existing RL algorithms, such as Q-Learning and SARSA, giving them the ability to solve almost-surely constrained problems.