论文标题

时间差异学习作为梯度分裂

Temporal Difference Learning as Gradient Splitting

论文作者

Liu, Rui, Olshevsky, Alex

论文摘要

线性函数近似的时间差学习是一种流行的方法,可以在马尔可夫决策过程中获得策略的价值函数的低维近似。我们从适当选择功能的梯度的分裂方面对这种方法进行了新的解释。由于这种解释,梯度下降的收敛证明几乎可以逐字应用于时间差异学习。除了给出时间差异为什么有效的新的,更充分的解释外,我们的解释还产生了改善的融合时间。我们考虑使用$ 1/\ sqrt {t} $ spepsize的设置,在此设置中,以前的可比较时间收敛时间限制了时间差学习的乘法因子$ 1/(1-γ)$在绑定的前面,$γ$是折现因子。我们表明,TD学习的微小变化分别估算值函数的平均值的收敛时间,其中$ 1/(1-γ)$仅乘以一个渐近可忽略的项。

Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We give a new interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a new, fuller explanation of why temporal difference works, our interpretation also yields improved convergence times. We consider the setting with $1/\sqrt{T}$ step-size, where previous comparable finite-time convergence time bounds for temporal difference learning had the multiplicative factor $1/(1-γ)$ in front of the bound, with $γ$ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-γ)$ only multiplies an asymptotically negligible term.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源