论文标题
RL的离线策略优化具有差异正常
Offline Policy Optimization in RL with Variance Regularizaton
论文作者
论文摘要
从固定的离线数据集中学习政策是扩展强化学习(RL)算法的关键挑战。这通常是因为由于数据集和目标策略之间的不匹配,因此非政策RL算法遭受了分配转移的影响,从而导致了较高的差异和价值函数的过度估计。在这项工作中,我们使用固定分布校正提出了离线RL算法的方差正则化。我们表明,通过使用Fenchel二元性,我们可以避免使用双重抽样问题来计算方差正常化程序的梯度。提出的脱机方差正则化算法(OVAR)可用于增强任何现有的离线策略优化算法。我们表明,正规器导致离线政策优化目标的下限,这可以帮助避免过度估计错误,并与现有的最新算法相比,解释了我们方法在一系列连续控制域中的好处。
Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms.