风险敏感的深度RL：差异约束的参与者批评可证明全球最佳政策

论文标题

风险敏感的深度RL：差异约束的参与者批评可证明全球最佳政策

Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy

论文作者

Zhong, Han, Deng, Xun, Fang, Ethan X., Yang, Zhuoran, Wang, Zhaoran, Li, Runze

论文摘要

尽管深度强化学习在各种应用中取得了巨大的成功，但大多数现有作品仅着重于最大化总回报的预期价值，因此忽略了其固有的随机性。这种随机性也被称为息肉不确定性，与风险概念密切相关。在这项工作中，我们首次尝试在具有差异风险标准的平均奖励环境下研究风险敏感的深度加强学习。特别是，我们专注于一个差异约束的政策优化问题，在该问题中，目标是找到一个最大化长期平均奖励的预期价值的政策，但要限制，即平均奖励的长期差异是由阈值上限的上限。利用Lagrangian和Fenchel二元性，我们将原始问题转换为不受限制的鞍策策略优化问题，并提出了一种迭代，有效地更新策略，Lagrange乘数和Fenchel二元变量的参与者评价算法。当价值和策略函数均由多层过度参数化的神经网络表示，我们证明我们的参与者 - 批评算法会生成一系列策略，这些策略以sublinear速率找到了全球最佳策略。此外，我们使用两个实际数据集提供了所提出方法的数值研究，以备份理论结果。

While deep reinforcement learning has achieved tremendous successes in various applications, most existing works only focus on maximizing the expected value of total return and thus ignore its inherent stochasticity. Such stochasticity is also known as the aleatoric uncertainty and is closely related to the notion of risk. In this work, we make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. In particular, we focus on a variance-constrained policy optimization problem where the goal is to find a policy that maximizes the expected value of the long-run average reward, subject to a constraint that the long-run variance of the average reward is upper bounded by a threshold. Utilizing Lagrangian and Fenchel dualities, we transform the original problem into an unconstrained saddle-point policy optimization problem, and propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable. When both the value and policy functions are represented by multi-layer overparameterized neural networks, we prove that our actor-critic algorithm generates a sequence of policies that finds a globally optimal policy at a sublinear rate. Further, We provide numerical studies of the proposed method using two real datasets to back up the theoretical results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题