使用准故事近似加速优化和增强学习

论文标题

使用准故事近似加速优化和增强学习

Accelerating Optimization and Reinforcement Learning with Quasi-Stochastic Approximation

论文作者

Chen, Shuhang, Devraj, Adithya, Bernstein, Andrey, Meyn, Sean

论文摘要

自从引入随机近似以来，ODE方法一直是算法设计和分析的主力。现在可以理解，融合理论等于建立欧拉近似值的稳健性，而收敛速度理论则需要更精细的分析。本文旨在根据“噪声”基于确定性信号的算法将该理论扩展到准故事近似。主要结果是在最小的假设下获得的：ode矢量场的通常Lipschitz条件，并且假定在最佳参数$θ^*$附近有一个良好的线性化，而Hurwitz Linearization Matrix $ a^*$。主要贡献总结如下：（i）如果算法增益为$ a_t = g/（1+t）^ρ$，$ g> 0 $和$ρ\ in（0,1）$，则算法的收敛速度为$ 1/t^ρ$。还有一个恰当定义的“有限 - $ t $”近似值：\ [a_t^{ - 1} \ {θ_t-θ^*\} = \ bar {y}+ξ^{\ Mathrm {i}} _ t+o（y Mathrm {i} _ t+o（1）在论文中，$ \ {ξ^{\ mathrm {i}} _ t \} $的时间均值为零。（ii）使用增益$ a_t = g/（1 + t）$，结果不那么尖锐：收敛率$ 1/t $仅在$ i + g a^*$是hurwitz时才保持。（iii）基于随机近似的ruppert-polyak平均值，人们希望通过平均值获得$ 1/t $的收敛速率：\ [θ^{\ [θ^{\ text {rp}} _ t = \ frac {1} $ \ {θ_t\} $使用（i）中的增益获得。前面的尖锐界限意味着当且仅当$ \ bar {y} = \ sf 0 $时，平均汇率会导致$ 1/t $收敛率。如果噪声是添加剂，则这种情况会保持这种状态，但总体上似乎失败了。（iv）该理论在无梯度优化和策略梯度算法中的应用中进行了说明。

The ODE method has been a workhorse for algorithm design and analysis since the introduction of the stochastic approximation. It is now understood that convergence theory amounts to establishing robustness of Euler approximations for ODEs, while theory of rates of convergence requires finer analysis. This paper sets out to extend this theory to quasi-stochastic approximation, based on algorithms in which the "noise" is based on deterministic signals. The main results are obtained under minimal assumptions: the usual Lipschitz conditions for ODE vector fields, and it is assumed that there is a well defined linearization near the optimal parameter $θ^*$, with Hurwitz linearization matrix $A^*$. The main contributions are summarized as follows: (i) If the algorithm gain is $a_t=g/(1+t)^ρ$ with $g>0$ and $ρ\in(0,1)$, then the rate of convergence of the algorithm is $1/t^ρ$. There is also a well defined "finite-$t$" approximation: \[ a_t^{-1}\{Θ_t-θ^*\}=\bar{Y}+Ξ^{\mathrm{I}}_t+o(1) \] where $\bar{Y}\in\mathbb{R}^d$ is a vector identified in the paper, and $\{Ξ^{\mathrm{I}}_t\}$ is bounded with zero temporal mean. (ii) With gain $a_t = g/(1+t)$ the results are not as sharp: the rate of convergence $1/t$ holds only if $I + g A^*$ is Hurwitz. (iii) Based on the Ruppert-Polyak averaging of stochastic approximation, one would expect that a convergence rate of $1/t$ can be obtained by averaging: \[ Θ^{\text{RP}}_T=\frac{1}{T}\int_{0}^T Θ_t\,dt \] where the estimates $\{Θ_t\}$ are obtained using the gain in (i). The preceding sharp bounds imply that averaging results in $1/t$ convergence rate if and only if $\bar{Y}=\sf 0$. This condition holds if the noise is additive, but appears to fail in general. (iv) The theory is illustrated with applications to gradient-free optimization and policy gradient algorithms for reinforcement learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题