用于规避风险的强化学习的均值变化政策迭代

论文标题

用于规避风险的强化学习的均值变化政策迭代

Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning

论文作者

Zhang, Shangtong, Liu, Bo, Whiteson, Shimon

论文摘要

我们提出了一个均值变化政策迭代（MVPI）框架，用于在折扣的无限地平线MDP中进行规避风险控制，以优化每个步骤奖励随机变量的差异。 MVPI享有极大的灵活性，因为在内部和非政策设置中，任何政策评估方法和风险中立的控制方法都可以放在架子上，以防止风险控制。这种灵活性减少了风险中性控制和避开风险控制之间的差距，并通过直接进行新的增强MDP来实现。我们提出了避免风险的TD3作为实例化MVPI的示例，该实例超过了Vanilla TD3，并且在挑战Mujoco机器人模拟任务下，在风险感知性能度量指标下进行了挑战。这种规避风险的TD3是第一个将确定性政策和非政策学习引入避免风险的强化学习的方法，这两者都是我们在Mujoco域中显示的绩效提升的关键。

We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP optimizing the variance of a per-step reward random variable. MVPI enjoys great flexibility in that any policy evaluation method and risk-neutral control method can be dropped in for risk-averse control off the shelf, in both on- and off-policy settings. This flexibility reduces the gap between risk-neutral control and risk-averse control and is achieved by working on a novel augmented MDP directly. We propose risk-averse TD3 as an example instantiating MVPI, which outperforms vanilla TD3 and many previous risk-averse control methods in challenging Mujoco robot simulation tasks under a risk-aware performance metric. This risk-averse TD3 is the first to introduce deterministic policies and off-policy learning into risk-averse reinforcement learning, both of which are key to the performance boost we show in Mujoco domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题