单时间尺度随机非Convex-Concave优化，用于平滑非线性TD学习

论文标题

单时间尺度随机非Convex-Concave优化，用于平滑非线性TD学习

Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning

论文作者

Qiu, Shuang, Yang, Zhuoran, Wei, Xiaohan, Ye, Jieping, Wang, Zhaoran

论文摘要

具有非线性平滑功能近似进行政策评估的时间差异（TD）学习在现代强化学习方面取得了巨大成功。结果表明，这样的问题可以重新构成为随机的非convex-rong-concave优化问题，这是挑战性的，因为天真的随机梯度下降算法受到缓慢的收敛性。此问题的现有方法基于两次计算或双环随机梯度算法，这也可能需要对大批量数据进行采样。但是，实际上，由于其简单性，因此首选单次单环随机算法，而且由于其踏板尺寸更容易调整。在本文中，我们提出了两种单时间尺度的单环算法，每个算法每个步骤仅需要一个数据点。我们的第一个算法实现了关于原始变量和双重变量的动量更新，这些变量达到了$ o（\ varepsilon^{ - 4}）$样本复杂性，这表明了动量在获得单次时计算算法中的重要作用。我们的第二种算法通过在动量之上施加差异来改善第一个算法，该算法匹配现有作品中最著名的$ o（\ varepsilon^{ - 3}）$样本复杂性。此外，我们的差异算法不需要大批次检查点。此外，我们两种算法的理论结果均以同时和双侧收敛的更严格形式表达。

Temporal-Difference (TD) learning with nonlinear smooth function approximation for policy evaluation has achieved great success in modern reinforcement learning. It is shown that such a problem can be reformulated as a stochastic nonconvex-strongly-concave optimization problem, which is challenging as naive stochastic gradient descent-ascent algorithm suffers from slow convergence. Existing approaches for this problem are based on two-timescale or double-loop stochastic gradient algorithms, which may also require sampling large-batch data. However, in practice, a single-timescale single-loop stochastic algorithm is preferred due to its simplicity and also because its step-size is easier to tune. In this paper, we propose two single-timescale single-loop algorithms which require only one data point each step. Our first algorithm implements momentum updates on both primal and dual variables achieving an $O(\varepsilon^{-4})$ sample complexity, which shows the important role of momentum in obtaining a single-timescale algorithm. Our second algorithm improves upon the first one by applying variance reduction on top of momentum, which matches the best known $O(\varepsilon^{-3})$ sample complexity in existing works. Furthermore, our variance-reduction algorithm does not require a large-batch checkpoint. Moreover, our theoretical results for both algorithms are expressed in a tighter form of simultaneous primal and dual side convergence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题