通过加强样品有效的增强学习

论文标题

通过加强样品有效的增强学习

Sample Efficient Reinforcement Learning with REINFORCE

论文作者

Zhang, Junzi, Kim, Jongho, O'Donoghue, Brendan, Boyd, Stephen

论文摘要

政策梯度方法是大规模增强学习的最有效方法之一，其经验成功促使了几项著作，这些著作发展了其全球融合理论的基础。但是，先前的工作要么需要精确的梯度或基于州行动措施的小批量随机梯度，其批量尺寸有分歧，这在实际情况下限制了它们的适用性。在本文中，我们考虑了经典的策略梯度方法，该方法在软马克斯参数化和对数范围的正则化下，计算具有单个轨迹或固定尺寸的小型轨迹的近似梯度，以及广泛使用的增强梯度估计过程。通过控制“不良”剧集的数量并诉诸古典的trick俩，我们建立了任何时间的次线性高概率后悔束缚，并且几乎可以确定平均遗憾的全球融合以及渐近的亚线性率。这些为众所周知的增强算法提供了第一组全球收敛和样本效率结果，并有助于更好地了解其在实践中的性能。

Policy gradient methods are among the most effective methods for large-scale reinforcement learning, and their empirical success has prompted several works that develop the foundation of their global convergence theory. However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their applicability in practical scenarios. In this paper, we consider classical policy gradient methods that compute an approximate gradient with a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization, along with the widely-used REINFORCE gradient estimation procedure. By controlling the number of "bad" episodes and resorting to the classical doubling trick, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate. These provide the first set of global convergence and sample efficiency results for the well-known REINFORCE algorithm and contribute to a better understanding of its performance in practice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题