论文标题
行为正规化离线加强学习的隐性政策
A Behavior Regularized Implicit Policy for Offline Reinforcement Learning
论文作者
论文摘要
离线增强学习可以从固定数据集中学习,而无需与环境进行进一步的互动。缺乏环境互动使政策培训容易受到国家行动对的影响,远非培训数据集,容易出现缺失奖励行动。为了培训更有效的代理商,我们提出了一个框架,以支持学习灵活但良好的完全尺寸化政策。我们进一步提出了对经典的政策匹配方法的简单修改,以根据Jensen-Shannon Divergence的双重形式和整体概率指标进行正规化。从理论上讲,我们显示了政策匹配方法的正确性,以及我们修改的正确性和良好的有限样本属性。提供了通过GAN结构对我们的框架进行有效的实例化,以及技术,以明确平滑状态映射,以使静态数据集以外的稳定性概括。关于D4RL基准的广泛实验和消融研究验证了我们的框架和算法设计的有效性。
Offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. The lack of environmental interactions makes the policy training vulnerable to state-action pairs far from the training dataset and prone to missing rewarding actions. For training more effective agents, we propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. We further propose a simple modification to the classical policy-matching methods for regularizing with respect to the dual form of the Jensen--Shannon divergence and the integral probability metrics. We theoretically show the correctness of the policy-matching approach, and the correctness and a good finite-sample property of our modification. An effective instantiation of our framework through the GAN structure is provided, together with techniques to explicitly smooth the state-action mapping for robust generalization beyond the static dataset. Extensive experiments and ablation study on the D4RL benchmark validate our framework and the effectiveness of our algorithmic designs.