重新思考目标条件的监督学习及其与离线RL的联系

论文标题

重新思考目标条件的监督学习及其与离线RL的联系

Rethinking Goal-conditioned Supervised Learning and Its Connection to Offline RL

论文作者

Yang, Rui, Lu, Yiming, Li, Wenzhe, Sun, Hao, Fang, Meng, Du, Yali, Li, Xiu, Han, Lei, Zhang, Chongjie

论文摘要

使用自我监督学习的稀疏奖励解决目标条件的任务是有希望的，因为它对当前的强化学习（RL）算法具有简单性和稳定性。最近的一项名为“目标有监督的学习”（GCSL）的作品，通过反复重新标记和模仿自我生成的体验，为新的学习框架提供了新的学习框架。在本文中，我们重新审视GCSL的理论特性 - 优化目标达到目标的下限，并扩展GCSL作为一种新型的离线目标调节RL算法。所提出的方法命名为加权GCSL（WGCSL），其中我们引入了一个高级复合权重，包括三个部分（1）目标重新标记的折扣重量，（2）目标有指数的指数优势，以及（3）最佳选择权重。从理论上讲，WGCSL被证明可以优化目标条件的RL目标的等效下限，并通过迭代方案单调改进的策略。单调属性符合任何行为策略，因此WGCSL可以应用于在线和离线设置。为了评估离线目标调节的RL设置中的算法，我们提供了一个基准，包括一系列点和模拟机器人域。引入的基准中的实验表明，WGCSL可以在完全离线目标调节设置中始终超过GCSL和现有的最新离线方法。

Solving goal-conditioned tasks with sparse rewards using self-supervised learning is promising because of its simplicity and stability over current reinforcement learning (RL) algorithms. A recent work, called Goal-Conditioned Supervised Learning (GCSL), provides a new learning framework by iteratively relabeling and imitating self-generated experiences. In this paper, we revisit the theoretical property of GCSL -- optimizing a lower bound of the goal reaching objective, and extend GCSL as a novel offline goal-conditioned RL algorithm. The proposed method is named Weighted GCSL (WGCSL), in which we introduce an advanced compound weight consisting of three parts (1) discounted weight for goal relabeling, (2) goal-conditioned exponential advantage weight, and (3) best-advantage weight. Theoretically, WGCSL is proved to optimize an equivalent lower bound of the goal-conditioned RL objective and generates monotonically improved policies via an iterated scheme. The monotonic property holds for any behavior policies, and therefore WGCSL can be applied to both online and offline settings. To evaluate algorithms in the offline goal-conditioned RL setting, we provide a benchmark including a range of point and simulated robot domains. Experiments in the introduced benchmark demonstrate that WGCSL can consistently outperform GCSL and existing state-of-the-art offline methods in the fully offline goal-conditioned setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题