继续做有效的事情：行为建模先验，用于离线加强学习

论文标题

继续做有效的事情：行为建模先验，用于离线加强学习

Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning

论文作者

Siegel, Noah Y., Springenberg, Jost Tobias, Berkenkamp, Felix, Abdolmaleki, Abbas, Neunert, Michael, Lampe, Thomas, Hafner, Roland, Heess, Nicolas, Riedmiller, Martin

论文摘要

非政策加固学习算法有望适用于只有固定数据集（批次）环境交互的设置，并且无法获得新的经验。该属性使这些算法吸引了现实世界中的问题，例如机器人控制。但是，在实践中，标准的非政策算法在批处理设置中以无法进行连续控制。在本文中，我们建议解决此问题的简单解决方案。它承认使用任意行为策略生成的数据，并使用了先验的先验 - 优势加权行为模型（ABM） - 将RL策略偏向以前已执行并可能在新任务上成功的行动。我们的方法可以看作是对批处理-RL的最新工作的扩展，该研究可以从相互冲突的数据源中学习稳定。我们发现在各种RL任务中对竞争基线的改进 - 包括标准的连续控制基准和模拟和真实世界机器人的多任务学习。

Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set (batch) of environment interactions is available and no new experience can be acquired. This property makes these algorithms appealing for real world problems such as robot control. In practice, however, standard off-policy algorithms fail in the batch setting for continuous control. In this paper, we propose a simple solution to this problem. It admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Our method can be seen as an extension of recent work on batch-RL that enables stable learning from conflicting data-sources. We find improvements on competitive baselines in a variety of RL tasks -- including standard continuous control benchmarks and multi-task learning for simulated and real-world robots.

下载PDF全文

下载文献需遵守相关版权规定

论文标题