论文标题

关于软马克斯政策梯度方法的全球收敛速率

On the Global Convergence Rates of Softmax Policy Gradient Methods

论文作者

Mei, Jincheng, Xiao, Chenjun, Szepesvari, Csaba, Schuurmans, Dale

论文摘要

我们为更好地理解表格环境中的政策梯度方法做出了三项贡献。首先,我们表明,使用真实的梯度,策略梯度,具有软马克斯参数化的策略梯度以$ O(1/t)$速率收敛,而常数取决于问题和初始化。该结果显着扩大了最近的渐近收敛结果。该分析依赖于两个发现:SoftMax策略梯度满足了lojasiewicz的不平等,并且在优化过程中最佳作用的最低概率可以根据其初始价值来限制。其次,我们分析了熵正规政策梯度,并表明它的线性收敛速率$ O(e^{ - c \ cdot t})$ offermmax Optimal Policy Policy $(C> 0)$。该结果在最近的文献中解决了一个公开问题。最后,结合上述两个结果和其他新的$ω(1/t)$下限结果,我们解释了熵正则化如何从收敛率的角度改善策略优化,即使是真正的梯度。使用非均匀的lojasiewicz度的概念进一步解释了速率的分离。这些结果提供了对熵的影响并证实现有的经验研究的理论理解。

We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a Łojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-c \cdot t})$ toward softmax optimal policy $(c > 0)$. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $Ω(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform Łojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源