模仿学习的排名游戏

论文标题

模仿学习的排名游戏

A Ranking Game for Imitation Learning

论文作者

Sikchi, Harshit, Saran, Akanksha, Goo, Wonjoon, Niekum, Scott

论文摘要

我们为模仿学习提供了一个新的框架 - 将模仿视为在政策和奖励之间基于两人排名的游戏。在这个游戏中，奖励代理商学会了在行为之间满足成对的表现排名，而政策代理人则学会了最大化这一奖励。在模仿学习中，很难获得近乎最佳的专家数据，即使在无限数据的极限下，也不能像偏好一样对轨迹进行总订购。另一方面，仅从偏好中学习就具有挑战性，因为需要大量的偏好来推断高维奖励功能，尽管偏好数据通常比专家演示更容易收集。经典的逆增强学习（IRL）的配方从专家演示中学习，但没有提供从离线偏好中学习的机制，反之亦然。我们将提出的排名游戏框架实例化，并具有新颖的排名损失，从而使算法可以同时从专家演示和偏好中学习，从而获得了两种方式的优势。我们的实验表明，所提出的方法可实现最新的样本效率，并可以从观察（LFO）设置中学习以前无法解决的任务。可以在https://hari-sikchi.github.io/rank-game/上找到项目视频和代码

We propose a new framework for imitation learning -- treating imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise performance rankings between behaviors, while the policy agent learns to maximize this reward. In imitation learning, near-optimal expert data can be difficult to obtain, and even in the limit of infinite data cannot imply a total ordering over trajectories as preferences can. On the other hand, learning from preferences alone is challenging as a large number of preferences are required to infer a high-dimensional reward function, though preference data is typically much easier to collect than expert demonstrations. The classical inverse reinforcement learning (IRL) formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences and vice versa. We instantiate the proposed ranking-game framework with a novel ranking loss giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities. Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (LfO) setting. Project video and code can be found at https://hari-sikchi.github.io/rank-game/

下载PDF全文

下载文献需遵守相关版权规定

论文标题