通过从单个演示中发现的例程增强政策学习

论文标题

通过从单个演示中发现的例程增强政策学习

Augmenting Policy Learning with Routines Discovered from a Single Demonstration

论文作者

Zhao, Zelin, Gan, Chuang, Wu, Jiajun, Guo, Xiaoxiao, Tenenbaum, Joshua B.

论文摘要

人类可以从很少的数据中抽象知识知识，并使用它来提高技能学习。在本文中，我们提出了常规的政策学习（RAPL），该政策学习发现了由单个演示中的原始行动组成的例程，并使用发现的例程来增强政策学习。为了从演示中发现例程，我们首先通过在演示的动作轨迹上识别语法来抽象例行候选。然后，选择按长度和频率测量的最佳例程以形成例程库。我们建议通过原始级别和例行级别同时学习策略，并利用常规的常规结构。我们的方法使在多个时间尺度上模仿专家行为，以模仿学习并促进强化学习探索。关于Atari游戏的广泛实验表明，Rapl改善了最先进的模仿学习方法SQIL和强化学习方法A2C。此外，我们表明，发现的例程可以概括为在Coinrun基准上的看不见的水平和困难。

Humans can abstract prior knowledge from very little data and use it to boost skill learning. In this paper, we propose routine-augmented policy learning (RAPL), which discovers routines composed of primitive actions from a single demonstration and uses discovered routines to augment policy learning. To discover routines from the demonstration, we first abstract routine candidates by identifying grammar over the demonstrated action trajectory. Then, the best routines measured by length and frequency are selected to form a routine library. We propose to learn policy simultaneously at primitive-level and routine-level with discovered routines, leveraging the temporal structure of routines. Our approach enables imitating expert behavior at multiple temporal scales for imitation learning and promotes reinforcement learning exploration. Extensive experiments on Atari games demonstrate that RAPL improves the state-of-the-art imitation learning method SQIL and reinforcement learning method A2C. Further, we show that discovered routines can generalize to unseen levels and difficulties on the CoinRun benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题