通过经验差异最小化的政策梯度方法的差异降低

论文标题

通过经验差异最小化的政策梯度方法的差异降低

Variance Reduction for Policy-Gradient Methods via Empirical Variance Minimization

论文作者

Kaledin, Maxim, Golubev, Alexander, Belomestny, Denis

论文摘要

加固学习（RL）的政策梯度方法非常普遍，在实践中广泛应用，但它们的性能遭受了梯度估计的较高差异。提出了几种程序来减少它，包括参与者批评（AC）和Advantage-Critic-Critic（A2C）方法。最近，由于引入了深入的RL：新的控制变量（CV）和新的子采样程序都可以在复杂模型（如神经网络）的设置中获得新的视角。基于简历的方法的重要部分是训练简历的目标功能，最受欢迎的方法是A2C的最小二乘标准。尽管取得了实际的成功，但标准并不是唯一可能的标准。在本文中，我们第一次研究称为经验方差（EV）的表现。我们在实验中观察到，不仅EV准则的性能并不比A2C差，而且有时可能会更好。除此之外，我们还证明了在非常一般的假设下实际差异的一些理论保证，并表明A2C最小二乘目标函数功能是EV目标的上限。我们的实验表明，在降低差异的方面，基于EV的方法比A2C好得多，并且可以降低差异。

Policy-gradient methods in Reinforcement Learning(RL) are very universal and widely applied in practice but their performance suffers from the high variance of the gradient estimate. Several procedures were proposed to reduce it including actor-critic(AC) and advantage actor-critic(A2C) methods. Recently the approaches have got new perspective due to the introduction of Deep RL: both new control variates(CV) and new sub-sampling procedures became available in the setting of complex models like neural networks. The vital part of CV-based methods is the goal functional for the training of the CV, the most popular one is the least-squares criterion of A2C. Despite its practical success, the criterion is not the only one possible. In this paper we for the first time investigate the performance of the one called Empirical Variance(EV). We observe in the experiments that not only EV-criterion performs not worse than A2C but sometimes can be considerably better. Apart from that, we also prove some theoretical guarantees of the actual variance reduction under very general assumptions and show that A2C least-squares goal functional is an upper bound for EV goal. Our experiments indicate that in terms of variance reduction EV-based methods are much better than A2C and allow stronger variance reduction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题