关于折扣政策梯度方法的收敛

论文标题

关于折扣政策梯度方法的收敛

On the Convergence of Discounted Policy Gradient Methods

论文作者

Nota, Chris

论文摘要

加强学习的许多流行政策梯度方法遵循偏见的策略梯度，称为折扣近似。虽然已经证明，策略梯度的折现并不是任何目标函数的梯度，但对其收敛行为或属性知之甚少。在本文中，我们表明，如果遵循折现的近似值，以使折扣因子以与降低学习率有关的速度缓慢增加，则结果方法将恢复梯度上升的标准保证在未验证的目标上。

Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted approximation is followed such that the discount factor is increased slowly at a rate related to a decreasing learning rate, the resulting method recovers the standard guarantees of gradient ascent on the undiscounted objective.

下载PDF全文

下载文献需遵守相关版权规定

论文标题