掌握无监督的强化学习基准从像素

论文标题

掌握无监督的强化学习基准从像素

Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels

论文作者

Rajeswar, Sai, Mazzaglia, Pietro, Verbelen, Tim, Piché, Alexandre, Dhoedt, Bart, Courville, Aaron, Lacoste, Alexandre

论文摘要

从视觉感觉数据中控制人造代理是一项艰巨的任务。增强学习（RL）算法可以成功，但需要代理与环境之间进行大量相互作用。为了减轻问题，无监督的RL建议采用自我监督的互动和学习，以更快地适应未来的任务。然而，如无监督的RL基准（URLB； Laskin等，2021）所示，当前的无监督策略是否可以提高概括能力，尤其是在视觉控制设置中。在这项工作中，我们研究了URLB，并提出了一种新方法，使用无监督的基于模型的RL进行预训练代理，以及一种任务意识的微调策略，并结合了新的拟议的混合计划者DYNA-MPC，以适应下游任务的代理。在URLB上，我们的方法获得了93.59％的总体归一化性能，超过了以前的基线。通过一项大规模实证研究对该方法进行了经验评估，我们用来验证我们的设计选择并分析模型。我们还在真实的RL基准测试中表现出强劲的性能，这暗示了适应过程中对环境扰动的弹性。项目网站：https：//masteringurlb.github.io/

Controlling artificial agents from visual sensory data is an arduous task. Reinforcement learning (RL) algorithms can succeed but require large amounts of interactions between the agent and the environment. To alleviate the issue, unsupervised RL proposes to employ self-supervised interaction and learning, for adapting faster to future tasks. Yet, as shown in the Unsupervised RL Benchmark (URLB; Laskin et al. 2021), whether current unsupervised strategies can improve generalization capabilities is still unclear, especially in visual control settings. In this work, we study the URLB and propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent, and a task-aware fine-tuning strategy combined with a new proposed hybrid planner, Dyna-MPC, to adapt the agent for downstream tasks. On URLB, our method obtains 93.59% overall normalized performance, surpassing previous baselines by a staggering margin. The approach is empirically evaluated through a large-scale empirical study, which we use to validate our design choices and analyze our models. We also show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation. Project website: https://masteringurlb.github.io/

下载PDF全文

下载文献需遵守相关版权规定

论文标题