时间差异学习作为梯度分裂

论文标题

时间差异学习作为梯度分裂

Temporal Difference Learning as Gradient Splitting

论文作者

Liu, Rui, Olshevsky, Alex

论文摘要

线性函数近似的时间差学习是一种流行的方法，可以在马尔可夫决策过程中获得策略的价值函数的低维近似。我们从适当选择功能的梯度的分裂方面对这种方法进行了新的解释。由于这种解释，梯度下降的收敛证明几乎可以逐字应用于时间差异学习。除了给出时间差异为什么有效的新的，更充分的解释外，我们的解释还产生了改善的融合时间。我们考虑使用$ 1/\ sqrt {t} $ spepsize的设置，在此设置中，以前的可比较时间收敛时间限制了时间差学习的乘法因子$ 1/（1-γ）$在绑定的前面，$γ$是折现因子。我们表明，TD学习的微小变化分别估算值函数的平均值的收敛时间，其中$ 1/（1-γ）$仅乘以一个渐近可忽略的项。

Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We give a new interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a new, fuller explanation of why temporal difference works, our interpretation also yields improved convergence times. We consider the setting with $1/\sqrt{T}$ step-size, where previous comparable finite-time convergence time bounds for temporal difference learning had the multiplicative factor $1/(1-γ)$ in front of the bound, with $γ$ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-γ)$ only multiplies an asymptotically negligible term.

下载PDF全文

下载文献需遵守相关版权规定

论文标题