论文标题
单时间尺度随机非Convex-Concave优化,用于平滑非线性TD学习
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning
论文作者
论文摘要
具有非线性平滑功能近似进行政策评估的时间差异(TD)学习在现代强化学习方面取得了巨大成功。结果表明,这样的问题可以重新构成为随机的非convex-rong-concave优化问题,这是挑战性的,因为天真的随机梯度下降算法受到缓慢的收敛性。此问题的现有方法基于两次计算或双环随机梯度算法,这也可能需要对大批量数据进行采样。但是,实际上,由于其简单性,因此首选单次单环随机算法,而且由于其踏板尺寸更容易调整。在本文中,我们提出了两种单时间尺度的单环算法,每个算法每个步骤仅需要一个数据点。我们的第一个算法实现了关于原始变量和双重变量的动量更新,这些变量达到了$ o(\ varepsilon^{ - 4})$样本复杂性,这表明了动量在获得单次时计算算法中的重要作用。我们的第二种算法通过在动量之上施加差异来改善第一个算法,该算法匹配现有作品中最著名的$ o(\ varepsilon^{ - 3})$样本复杂性。此外,我们的差异算法不需要大批次检查点。此外,我们两种算法的理论结果均以同时和双侧收敛的更严格形式表达。
Temporal-Difference (TD) learning with nonlinear smooth function approximation for policy evaluation has achieved great success in modern reinforcement learning. It is shown that such a problem can be reformulated as a stochastic nonconvex-strongly-concave optimization problem, which is challenging as naive stochastic gradient descent-ascent algorithm suffers from slow convergence. Existing approaches for this problem are based on two-timescale or double-loop stochastic gradient algorithms, which may also require sampling large-batch data. However, in practice, a single-timescale single-loop stochastic algorithm is preferred due to its simplicity and also because its step-size is easier to tune. In this paper, we propose two single-timescale single-loop algorithms which require only one data point each step. Our first algorithm implements momentum updates on both primal and dual variables achieving an $O(\varepsilon^{-4})$ sample complexity, which shows the important role of momentum in obtaining a single-timescale algorithm. Our second algorithm improves upon the first one by applying variance reduction on top of momentum, which matches the best known $O(\varepsilon^{-3})$ sample complexity in existing works. Furthermore, our variance-reduction algorithm does not require a large-batch checkpoint. Moreover, our theoretical results for both algorithms are expressed in a tighter form of simultaneous primal and dual side convergence.