正则校正梯度时间差异学习

论文标题

正则校正梯度时间差异学习

Gradient Temporal-Difference Learning with Regularized Corrections

论文作者

Ghiassian, Sina, Patterson, Andrew, Garg, Shivam, Gupta, Dhawal, White, Adam, White, Martha

论文摘要

使用Q学习和时间差异（TD）学习仍然很常见 - 即使它们有分歧问题，并且声音梯度TD替代方案存在，因为偏见似乎很少，而且通常表现良好。但是，最近与大型神经网络学习系统的工作表明，不稳定性比以前想象的更普遍。从业者面临困难的困境：选择一种易于使用和表现的TD方法，或更复杂的算法，该算法更加声音，但很难调节，并且几乎没有使用非线性函数近似或对照来探索。在本文中，我们介绍了一种带有正规化校正（TDRC）的名为TD的新方法，该方法试图平衡易用性，健全性和性能。当TD表现良好时，它的行为和TD的表现和TD都一样，但是在TD差异的情况下，它的声音是合理的。我们从经验上研究TDRC，用于预测和对照，以及线性和非线性函数近似，并可能首次显示该梯度TD方法可能是TD和Q-学习的更好替代方案。

It is still common to use Q-learning and temporal difference (TD) learning-even though they have divergence issues and sound Gradient TD alternatives exist-because divergence seems rare and they typically perform well. However, recent work with large neural network learning systems reveals that instability is more common than previously thought. Practitioners face a difficult dilemma: choose an easy to use and performant TD method, or a more complex algorithm that is more sound but harder to tune and all but unexplored with non-linear function approximation or control. In this paper, we introduce a new method called TD with Regularized Corrections (TDRC), that attempts to balance ease of use, soundness, and performance. It behaves as well as TD, when TD performs well, but is sound in cases where TD diverges. We empirically investigate TDRC across a range of problems, for both prediction and control, and for both linear and non-linear function approximation, and show, potentially for the first time, that gradient TD methods could be a better alternative to TD and Q-learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题