具有已知约束功能的多能管理系统的安全加强学习

论文标题

具有已知约束功能的多能管理系统的安全加强学习

Safe reinforcement learning for multi-energy management systems with known constraint functions

论文作者

Ceusters, Glenn, Camargo, Luis Ramirez, Franke, Rüdiger, Nowé, Ann, Messagie, Maarten

论文摘要

增强学习（RL）是多能管理系统的有前途的最佳控制技术。它不需要先验模型 - 减少前期和正在进行的项目特定工程工作，并且能够学习基础系统动力学的更好表示。但是，香草RL不能提供约束满意度的保证 - 在其安全至关重要的环境中导致各种潜在的不安全相互作用。在本文中，我们提出了两种新型的安全RL方法，即SafeFallback和Afvafe，其中安全约束配方与RL配方解耦。这些提供了艰苦的构造，而不是软性和偶然的构成，满意度可以保证在培训（近乎）最佳政策（涉及探索性和剥削性的）期间，即贪婪，贪婪，步骤）以及在部署任何政策（例如随机代理或随机代理商或离线培训的RL RL代理商）。这无需求解数学程序，从而导致计算能力要求较少和更灵活的约束功能公式（无需衍生信息）。在模拟的多能系统案例研究中，我们已经表明，这两种方法均以与香草RL基准和OptLayer基准相比明显更高的效用（即有用的政策）开头（94,6％和82,8％（94,6％和82,8％），而35.5％和77,8％和77.8％和77,8％），甚至提议的Safefallback方法甚至可以超过100％的vanilla reench vanilla rlla rlla rlla rlla rlla bench。我们得出的结论是，这两种方法都是安全约束处理技术，可在RL之外使用，这是随机策略所证明的，同时仍提供坚硬的保证。

Reinforcement learning (RL) is a promising optimal control technique for multi-energy management systems. It does not require a model a priori - reducing the upfront and ongoing project-specific engineering effort and is capable of learning better representations of the underlying system dynamics. However, vanilla RL does not provide constraint satisfaction guarantees - resulting in various potentially unsafe interactions within its safety-critical environment. In this paper, we present two novel safe RL methods, namely SafeFallback and GiveSafe, where the safety constraint formulation is decoupled from the RL formulation. These provide hard-constraint, rather than soft- and chance-constraint, satisfaction guarantees both during training a (near) optimal policy (which involves exploratory and exploitative, i.e. greedy, steps) as well as during deployment of any policy (e.g. random agents or offline trained RL agents). This without the need of solving a mathematical program, resulting in less computational power requirements and a more flexible constraint function formulation (no derivative information is required). In a simulated multi-energy systems case study we have shown that both methods start with a significantly higher utility (i.e. useful policy) compared to a vanilla RL benchmark and Optlayer benchmark (94,6% and 82,8% compared to 35,5% and 77,8%) and that the proposed SafeFallback method even can outperform the vanilla RL benchmark (102,9% to 100%). We conclude that both methods are viably safety constraint handling techniques applicable beyond RL, as demonstrated with random policies while still providing hard-constraint guarantees.

下载PDF全文

下载文献需遵守相关版权规定

论文标题