论文标题
双Q学习的有限时间分析
Finite-Time Analysis for Double Q-learning
论文作者
论文摘要
尽管Q-学习是在强化学习中找到最佳动作值函数(以及最佳策略)最成功的算法之一,但其实施通常遭受了对随机抽样产生的Q功能值的大量高估。在〜\ citet {hasselt2010double}中提出的双Q学习算法通过随机切换两个Q-估计器之间的更新来克服这种高估问题,因此在实践中已获得了很大的知名度。但是,对双Q学习的理论理解是相当有限的。到目前为止,只有建立了渐近收敛性,这并未表征算法收敛的速度。在本文中,我们为双Q学习提供了第一个非反应(即有限时间)分析。我们表明,通过服用$ \tildeΩ\ left(\ left(\ frac {1} {(1-γ)^6ε^2} \ right(\ frac {1}}^2} \ righ, +\ left(\ frac {1} {1-γ} \ right)^{\ frac {1} {1-ω}}}} \ right)$ tererations,其中$ω\ in(0,1)$是学习率的衰减参数,而$γ$是折扣因子。我们的分析开发了新的技术,以在两个相互连接的随机过程之间的差异中得出有限的时间界限,这是随机近似文献的新技术。
Although Q-learning is one of the most successful algorithms for finding the best action-value function (and thus the optimal policy) in reinforcement learning, its implementation often suffers from large overestimation of Q-function values incurred by random sampling. The double Q-learning algorithm proposed in~\citet{hasselt2010double} overcomes such an overestimation issue by randomly switching the update between two Q-estimators, and has thus gained significant popularity in practice. However, the theoretical understanding of double Q-learning is rather limited. So far only the asymptotic convergence has been established, which does not characterize how fast the algorithm converges. In this paper, we provide the first non-asymptotic (i.e., finite-time) analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $ε$-accurate neighborhood of the global optimum by taking $\tildeΩ\left(\left( \frac{1}{(1-γ)^6ε^2}\right)^{\frac{1}ω} +\left(\frac{1}{1-γ}\right)^{\frac{1}{1-ω}}\right)$ iterations, where $ω\in(0,1)$ is the decay parameter of the learning rate, and $γ$ is the discount factor. Our analysis develops novel techniques to derive finite-time bounds on the difference between two inter-connected stochastic processes, which is new to the literature of stochastic approximation.