有效的规避风险的增强学习

论文标题

有效的规避风险的增强学习

Efficient Risk-Averse Reinforcement Learning

论文作者

Greenberg, Ido, Chow, Yinlam, Ghavamzadeh, Mohammad, Mannor, Shie

论文摘要

在规避风险的增强学习（RL）中，目标是优化回报的一些风险度量。风险措施通常集中在代理商的经验中最糟糕的回报。结果，规避风险RL的标准方法通常忽略高回归策略。我们证明，在某些条件下，这不可避免地会导致局部最佳障碍，并提出了一种软风险机制来绕过它。我们还设计了一个新颖的跨熵模块，用于风险采样，该模块（1）尽管有柔软的风险，但仍保留了风险规避；（2）独立提高样品效率。通过将采样器和优化器的风险规避分开，我们可以采样条件较差的发作，但在成功的策略方面进行了优化。我们将这两个概念结合在cesor-跨膜片软风险优化算法中 - 可以在任何规避风险的策略梯度（PG）方法的基础上应用。我们证明了迷宫导航，自动驾驶和资源分配基准的改善风险规避，包括在标准规避风险的PG完全失败的情况下。

In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns. A risk measure often focuses on the worst returns out of the agent's experience. As a result, standard methods for risk-averse RL often ignore high-return strategies. We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a soft risk mechanism to bypass it. We also devise a novel Cross Entropy module for risk sampling, which (1) preserves risk aversion despite the soft risk; (2) independently improves sample efficiency. By separating the risk aversion of the sampler and the optimizer, we can sample episodes with poor conditions, yet optimize with respect to successful strategies. We combine these two concepts in CeSoR - Cross-entropy Soft-Risk optimization algorithm - which can be applied on top of any risk-averse policy gradient (PG) method. We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks, including in scenarios where standard risk-averse PG completely fails.

下载PDF全文

下载文献需遵守相关版权规定

论文标题