加权Q学习深度加固学习

论文标题

加权Q学习深度加固学习

Deep Reinforcement Learning with Weighted Q-Learning

论文作者

Cini, Andrea, D'Eramo, Carlo, Peters, Jan, Alippi, Cesare

论文摘要

基于Q学习的增强学习算法正在推动深度加强学习（DRL）研究，以解决复杂的问题并在其中许多方面实现超人的表现。然而，已知Q-学习是积极偏见的，因为它通过使用最大值的预期值噪声来学习。对动作值的系统高估与DRL方法的固有高方差相结合会导致逐渐积累的错误，从而导致学习算法的差异。理想情况下，我们希望DRL代理人考虑到他们对每个动作最佳性的不确定性，并能够利用它以对预期收益进行更明智的估计。在这方面，加权Q学习（WQL）有效地减少了偏见，并在随机环境中显示出显着的结果。 WQL使用估计的动作值的加权总和，其中权重对应于每个动作值最大的概率；但是，这些概率的计算仅在表格设置中是实用的。在这项工作中，我们通过使用接受辍学训练的神经网络作为深层过程的有效近似，从而提供了方法上的进步，以从DRL中的WQL属性中受益。特别是，我们采用具体的辍学变体来获得DRL认知不确定性的校准估计值。然后，通过采取几个随机向前通过动作值网络并以蒙特卡洛的方式计算权重来获得估计器。这样的权重是对应于最大W.R.T.的每个动作值的概率的贝叶斯估计。通过辍学估计的后验概率分布。我们展示了我们的新颖加权Q学习算法如何减少偏见W.R.T.相关基线，并提供了其在代表性基准方面的优势的经验证据。

Reinforcement learning algorithms based on Q-learning are driving Deep Reinforcement Learning (DRL) research towards solving complex problems and achieving super-human performance on many of them. Nevertheless, Q-Learning is known to be positively biased since it learns by using the maximum over noisy estimates of expected values. Systematic overestimation of the action values coupled with the inherently high variance of DRL methods can lead to incrementally accumulate errors, causing learning algorithms to diverge. Ideally, we would like DRL agents to take into account their own uncertainty about the optimality of each action, and be able to exploit it to make more informed estimations of the expected return. In this regard, Weighted Q-Learning (WQL) effectively reduces bias and shows remarkable results in stochastic environments. WQL uses a weighted sum of the estimated action values, where the weights correspond to the probability of each action value being the maximum; however, the computation of these probabilities is only practical in the tabular setting. In this work, we provide methodological advances to benefit from the WQL properties in DRL, by using neural networks trained with Dropout as an effective approximation of deep Gaussian processes. In particular, we adopt the Concrete Dropout variant to obtain calibrated estimates of epistemic uncertainty in DRL. The estimator, then, is obtained by taking several stochastic forward passes through the action-value network and computing the weights in a Monte Carlo fashion. Such weights are Bayesian estimates of the probability of each action value corresponding to the maximum w.r.t. a posterior probability distribution estimated by Dropout. We show how our novel Deep Weighted Q-Learning algorithm reduces the bias w.r.t. relevant baselines and provides empirical evidence of its advantages on representative benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题