基于模型的离线元强化学习与正规化

论文标题

基于模型的离线元强化学习与正规化

Model-Based Offline Meta-Reinforcement Learning with Regularization

论文作者

Lin, Sen, Wan, Jialin, Xu, Tengyu, Liang, Yingbin, Zhang, Junshan

论文摘要

现有的离线增强学习（RL）方法面临一些主要挑战，尤其是学识渊博的政策与行为政策之间的分配转变。离线Meta-RL正在成为解决这些挑战的一种有前途的方法，旨在从一系列任务中学习信息丰富的元元素。然而，如我们在经验研究中所示，离线元素RL的表现可以用离线单任务RL方法在具有良好数据集质量的任务上的单个任务RL方法所用，这表明必须通过按照元数据范围来遵循“元素分布式”的状态行为，通过遵循元数据和“探索”的行为，必须通过“探索”过度的状态行为来进行“探索”的状态行为，并通过“探索过度”的行为来进行行为，并通过遵守行为来保持行为的行为。通过这种经验分析的促进，我们探索了基于模型的离线元RL，并具有正规政策优化（MERPO），该元模型学习了有效的任务结构推理的元模型，以及一个信息丰富的元元素，以安全地探索过分分发的状态活动。特别是，我们使用保守的策略评估和正规政策改进，设计了一种新的基于元指数的基于元指数的基于元模型的参与者批判性（RAC）方法，以作为Merpo的关键构建块作为Merpo的关键构建块；其中的内在权衡是通过在两个正规机构之间达到正确的平衡来实现的，一个是基于行为政策，另一个基于元元素。从理论上讲，我们学识渊博的政策可以保证对行为策略和元政策都有确保改进，从而确保了通过离线Meta-RL对新任务的绩效提高。实验证实了Merpo优于现有离线元RL方法的出色性能。

Existing offline reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Offline Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between "exploring" the out-of-distribution state-actions by following the meta-policy and "exploiting" the offline dataset by staying close to the behavior policy. Motivated by such empirical analysis, we explore model-based offline Meta-RL with regularized Policy Optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via offline Meta-RL. Experiments corroborate the superior performance of MerPO over existing offline Meta-RL methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题