行为正规化离线加强学习的隐性政策

论文标题

行为正规化离线加强学习的隐性政策

A Behavior Regularized Implicit Policy for Offline Reinforcement Learning

论文作者

Yang, Shentao, Wang, Zhendong, Zheng, Huangjie, Feng, Yihao, Zhou, Mingyuan

论文摘要

离线增强学习可以从固定数据集中学习，而无需与环境进行进一步的互动。缺乏环境互动使政策培训容易受到国家行动对的影响，远非培训数据集，容易出现缺失奖励行动。为了培训更有效的代理商，我们提出了一个框架，以支持学习灵活但良好的完全尺寸化政策。我们进一步提出了对经典的政策匹配方法的简单修改，以根据Jensen-Shannon Divergence的双重形式和整体概率指标进行正规化。从理论上讲，我们显示了政策匹配方法的正确性，以及我们修改的正确性和良好的有限样本属性。提供了通过GAN结构对我们的框架进行有效的实例化，以及技术，以明确平滑状态映射，以使静态数据集以外的稳定性概括。关于D4RL基准的广泛实验和消融研究验证了我们的框架和算法设计的有效性。

Offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. The lack of environmental interactions makes the policy training vulnerable to state-action pairs far from the training dataset and prone to missing rewarding actions. For training more effective agents, we propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. We further propose a simple modification to the classical policy-matching methods for regularizing with respect to the dual form of the Jensen--Shannon divergence and the integral probability metrics. We theoretically show the correctness of the policy-matching approach, and the correctness and a good finite-sample property of our modification. An effective instantiation of our framework through the GAN structure is provided, together with techniques to explicitly smooth the state-action mapping for robust generalization beyond the static dataset. Extensive experiments and ablation study on the D4RL benchmark validate our framework and the effectiveness of our algorithmic designs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题