论文标题
梯度下降时间差差学习
Gradient Descent Temporal Difference-difference Learning
论文作者
论文摘要
事实证明,在这种算法中,行为政策与目标政策不同,并用于获得学习经验,在强化学习中具有巨大的实践价值。但是,即使对于简单的凸问题(例如线性值函数近似),这些算法也不能保证是稳定的。为了解决这个问题,在这种情况下引入了可证明会收敛的替代算法,最著名的是梯度下降时间差异(GTD)学习。然而,这种算法和其他类似的算法往往比传统的时间差异学习更慢得多。在本文中,我们建议通过在连续参数更新中引入二阶差异来提高GTD2的梯度下降时间差异(梯度-DD)学习。我们在线性值函数近似的框架中研究了该算法,从理论上讲,通过应用随机近似理论来证明其收敛性。 %在分析上显示其对GTD2的改善。通过经验研究该模型的随机步行任务,Boyan链任务和Baird的非政策反例,我们发现与GTD2相比,与传统TD学习相比,GTD2的实质性改善甚至更好。
Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning. However, even for simple convex problems such as linear value function approximation, these algorithms are not guaranteed to be stable. To address this, alternative algorithms that are provably convergent in such cases have been introduced, the most well known being gradient descent temporal difference (GTD) learning. This algorithm and others like it, however, tend to converge much more slowly than conventional temporal difference learning. In this paper we propose gradient descent temporal difference-difference (Gradient-DD) learning in order to improve GTD2, a GTD algorithm, by introducing second-order differences in successive parameter updates. We investigate this algorithm in the framework of linear value function approximation, theoretically proving its convergence by applying the theory of stochastic approximation. %analytically showing its improvement over GTD2. Studying the model empirically on the random walk task, the Boyan-chain task, and the Baird's off-policy counterexample, we find substantial improvement over GTD2 and, in several cases, better performance even than conventional TD learning.