双Q学习的有限时间分析

论文标题

双Q学习的有限时间分析

Finite-Time Analysis for Double Q-learning

论文作者

Xiong, Huaqing, Zhao, Lin, Liang, Yingbin, Zhang, Wei

论文摘要

尽管Q-学习是在强化学习中找到最佳动作值函数（以及最佳策略）最成功的算法之一，但其实施通常遭受了对随机抽样产生的Q功能值的大量高估。在〜\ citet {hasselt2010double}中提出的双Q学习算法通过随机切换两个Q-估计器之间的更新来克服这种高估问题，因此在实践中已获得了很大的知名度。但是，对双Q学习的理论理解是相当有限的。到目前为止，只有建立了渐近收敛性，这并未表征算法收敛的速度。在本文中，我们为双Q学习提供了第一个非反应（即有限时间）分析。我们表明，通过服用$ \tildeΩ\ left（\ left（\ frac {1} {（1-γ）^6ε^2} \ right（\ frac {1}}^2} \ righ， +\ left（\ frac {1} {1-γ} \ right）^{\ frac {1} {1-ω}}}} \ right）$ tererations，其中$ω\ in（0,1）$是学习率的衰减参数，而$γ$是折扣因子。我们的分析开发了新的技术，以在两个相互连接的随机过程之间的差异中得出有限的时间界限，这是随机近似文献的新技术。

Although Q-learning is one of the most successful algorithms for finding the best action-value function (and thus the optimal policy) in reinforcement learning, its implementation often suffers from large overestimation of Q-function values incurred by random sampling. The double Q-learning algorithm proposed in~\citet{hasselt2010double} overcomes such an overestimation issue by randomly switching the update between two Q-estimators, and has thus gained significant popularity in practice. However, the theoretical understanding of double Q-learning is rather limited. So far only the asymptotic convergence has been established, which does not characterize how fast the algorithm converges. In this paper, we provide the first non-asymptotic (i.e., finite-time) analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $ε$-accurate neighborhood of the global optimum by taking $\tildeΩ\left(\left( \frac{1}{(1-γ)^6ε^2}\right)^{\frac{1}ω} +\left(\frac{1}{1-γ}\right)^{\frac{1}{1-ω}}\right)$ iterations, where $ω\in(0,1)$ is the decay parameter of the learning rate, and $γ$ is the discount factor. Our analysis develops novel techniques to derive finite-time bounds on the difference between two inter-connected stochastic processes, which is new to the literature of stochastic approximation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题