现成的政策一：通过积极学习建立世界

论文标题

现成的政策一：通过积极学习建立世界

Ready Policy One: World Building Through Active Learning

论文作者

Ball, Philip, Parker-Holder, Jack, Pacchiano, Aldo, Choromanski, Krzysztof, Roberts, Stephen

论文摘要

基于模型的强化学习（MBRL）为样本有效学习提供了一个有希望的方向，通常可以实现连续控制任务的最新结果。但是，许多现有的MBRL方法都依赖于将贪婪的政策与探索启发式方法相结合，甚至是利用原则探索奖金以临时方式构建双重目标的政策。在本文中，我们介绍了现成的政策One（RP1），该框架将MBRL视为一个积极的学习问题，我们旨在以最少的样本来改善世界模型。 RP1通过利用混合目标函数来实现这一目标，该功能在优化过程中至关重要，从而允许算法进行交易奖励V.S.在不同学习阶段进行探索。此外，一旦我们拥有足够丰富的轨迹批次以改善模型，我们引入了一种原则性机制来终止样本收集。我们严格评估我们的方法对各种连续的控制任务，并在现有方法上证明了具有统计学意义的收益。

Model-Based Reinforcement Learning (MBRL) offers a promising direction for sample efficient learning, often achieving state of the art results for continuous control tasks. However, many existing MBRL methods rely on combining greedy policies with exploration heuristics, and even those which utilize principled exploration bonuses construct dual objectives in an ad hoc fashion. In this paper we introduce Ready Policy One (RP1), a framework that views MBRL as an active learning problem, where we aim to improve the world model in the fewest samples possible. RP1 achieves this by utilizing a hybrid objective function, which crucially adapts during optimization, allowing the algorithm to trade off reward v.s. exploration at different stages of learning. In addition, we introduce a principled mechanism to terminate sample collection once we have a rich enough trajectory batch to improve the model. We rigorously evaluate our method on a variety of continuous control tasks, and demonstrate statistically significant gains over existing approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题