论文标题
自我修复Q学习
Self-correcting Q-Learning
论文作者
论文摘要
已知Q学习算法受到最大化偏差的影响,即对动作值的系统性高估,这是一个最近引起的重新注意的重要问题。已提出双重Q学习作为一种有效的算法来减轻这种偏见。但是,除了增加内存需求和收敛较慢之外,这是以低估行动值低估的代价。在本文中,我们引入了一种新的方法,以“自我校正算法”的形式解决最大化偏置,以近似期望值的最大值。我们的方法平衡了在常规Q学习中使用的单个估计器的高估,以及在双Q学习中使用的双估计器的低估。将此策略应用于Q学习会导致自我校正Q学习。从理论上讲,我们表明这种新算法具有与Q学习相同的融合保证,同时更准确。从经验上讲,它在具有较高差异的域中的域中的性能要比双Q学习更好,甚至比Q-Learning的收敛速度比Q-学习的速度更快,而Q-Learning的差异为零或低方差。这些优势转移到了深度Q网络实现中,我们称之为自我校正的DQN,并且在Atari 2600域中的多个任务上超过了常规DQN和Double DQN。
The Q-learning algorithm is known to be affected by the maximization bias, i.e. the systematic overestimation of action values, an important issue that has recently received renewed attention. Double Q-learning has been proposed as an efficient algorithm to mitigate this bias. However, this comes at the price of an underestimation of action values, in addition to increased memory requirements and a slower convergence. In this paper, we introduce a new way to address the maximization bias in the form of a "self-correcting algorithm" for approximating the maximum of an expected value. Our method balances the overestimation of the single estimator used in conventional Q-learning and the underestimation of the double estimator used in Double Q-learning. Applying this strategy to Q-learning results in Self-correcting Q-learning. We show theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate. Empirically, it performs better than Double Q-learning in domains with rewards of high variance, and it even attains faster convergence than Q-learning in domains with rewards of zero or low variance. These advantages transfer to a Deep Q Network implementation that we call Self-correcting DQN and which outperforms regular DQN and Double DQN on several tasks in the Atari 2600 domain.