通过向后值函数限制了马尔可夫决策过程

论文标题

通过向后值函数限制了马尔可夫决策过程

Constrained Markov Decision Processes via Backward Value Functions

论文作者

Satija, Harsh, Amortila, Philip, Pineau, Joelle

论文摘要

尽管增强学习（RL）算法在模拟领域中发现了巨大的成功，但它们通常不能直接应用于物理系统，尤其是在有严格的限制需要满足的情况下（例如，在安全或资源上）。在标准RL中，只要它最大化奖励，就会激励代理探索任何行为，但是在现实世界中，不希望的行为可以以破坏学习过程本身的方式损害系统或代理。在这项工作中，我们将学习问题用约束作为约束的马尔可夫决策过程进行建模，并为解决它提供新的政策配方。我们方法的关键贡献是将累积成本约束转化为基于州的约束。通过此，我们定义了一种安全的策略改进方法，该方法可最大程度地提高回报，同时确保每一步都满足约束。我们提供了理论保证，代理商在培训过程中确保安全性的同时确保安全。我们还强调了这种方法的计算优势。我们的方法的有效性是在安全导航任务以及具有深层神经网络的Mujoco环境的安全性限制版本中证明的。

Although Reinforcement Learning (RL) algorithms have found tremendous success in simulated domains, they often cannot directly be applied to physical systems, especially in cases where there are hard constraints to satisfy (e.g. on safety or resources). In standard RL, the agent is incentivized to explore any behavior as long as it maximizes rewards, but in the real world, undesired behavior can damage either the system or the agent in a way that breaks the learning process itself. In this work, we model the problem of learning with constraints as a Constrained Markov Decision Process and provide a new on-policy formulation for solving it. A key contribution of our approach is to translate cumulative cost constraints into state-based constraints. Through this, we define a safe policy improvement method which maximizes returns while ensuring that the constraints are satisfied at every step. We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training. We also highlight the computational advantages of this approach. The effectiveness of our approach is demonstrated on safe navigation tasks and in safety-constrained versions of MuJoCo environments, with deep neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题