关于非政策增强学习的重复使用偏见

论文标题

关于非政策增强学习的重复使用偏见

On the Reuse Bias in Off-Policy Reinforcement Learning

论文作者

Ying, Chengyang, Hao, Zhongkai, Zhou, Xinning, Su, Hang, Yan, Dong, Zhu, Jun

论文摘要

重要性采样（IS）是非政策评估中的一种流行技术，它重新赋予了重播缓冲液中轨迹的回归以提高样本效率。但是，对IS进行培训可能是不稳定的，以前试图解决此问题的尝试主要集中于分析IS的差异。在本文中，我们揭示了不稳定性与IS的重复使用偏见的新概念有关 - 由重复使用缓冲液重用进行评估和优化引起的非政策评估的偏差。从理论上讲，我们证明了当前策略的非政策评估和优化来自重播缓冲区的数据会导致目标高估，这可能会导致错误的梯度更新并退化性能。我们进一步提供了重复使用偏差的高概率上限，并表明控制上限的一个项可以通过引入非政策算法的稳定性概念来控制重复使用偏差。基于这些分析，我们最终提出了一种新颖的偏见调查重要性采样（BIRIS）框架以及实际算法，可以减轻重复使用偏见的负面影响。实验结果表明，我们基于BIRIS的方法可以显着提高一系列连续控制任务的样本效率。

Importance sampling (IS) is a popular technique in off-policy evaluation, which re-weights the return of trajectories in the replay buffer to boost sample efficiency. However, training with IS can be unstable and previous attempts to address this issue mainly focus on analyzing the variance of IS. In this paper, we reveal that the instability is also related to a new notion of Reuse Bias of IS -- the bias in off-policy evaluation caused by the reuse of the replay buffer for evaluation and optimization. We theoretically show that the off-policy evaluation and optimization of the current policy with the data from the replay buffer result in an overestimation of the objective, which may cause an erroneous gradient update and degenerate the performance. We further provide a high-probability upper bound of the Reuse Bias, and show that controlling one term of the upper bound can control the Reuse Bias by introducing the concept of stability for off-policy algorithms. Based on these analyses, we finally present a novel Bias-Regularized Importance Sampling (BIRIS) framework along with practical algorithms, which can alleviate the negative impact of the Reuse Bias. Experimental results show that our BIRIS-based methods can significantly improve the sample efficiency on a series of continuous control tasks in MuJoCo.

下载PDF全文

下载文献需遵守相关版权规定

论文标题