通过匹配最佳概况的逆增强学习

论文标题

通过匹配最佳概况的逆增强学习

Inverse Reinforcement Learning via Matching of Optimality Profiles

论文作者

Haug, Luis, Ovinnikov, Ivan, Bykovets, Eugene

论文摘要

逆增强学习（IRL）的目的是推断奖励功能，该奖励功能解释了执行任务的代理商的行为。大多数方法的假设是所证明的行为几乎是最佳的。但是，在许多现实世界中，真正最佳行为的示例很少，并且希望有效利用次优或异质性能的演示集，这很容易获得。我们提出了一种算法，该算法从此类示范中学习奖励功能，以及以较弱的监督信号的形式，其形式是在演示过程中收集的奖励（或更一般而言，是根据累计折扣未来的奖励）的奖励。我们认为这样的分布也将其称为最佳概况，是示威活动的最佳程度的摘要，例如，可能会反映人类专家的意见。给定最佳曲线和少量的额外监督，我们的算法通过基本上最大程度地降低相应诱导分布和最佳曲线之间的Wasserstein距离来拟合以神经网络为模型的奖励函数。我们表明我们的方法能够学习奖励功能，从而使训练以优化它们的政策优于适合奖励功能的演示。

The goal of inverse reinforcement learning (IRL) is to infer a reward function that explains the behavior of an agent performing a task. The assumption that most approaches make is that the demonstrated behavior is near-optimal. In many real-world scenarios, however, examples of truly optimal behavior are scarce, and it is desirable to effectively leverage sets of demonstrations of suboptimal or heterogeneous performance, which are easier to obtain. We propose an algorithm that learns a reward function from such demonstrations together with a weak supervision signal in the form of a distribution over rewards collected during the demonstrations (or, more generally, a distribution over cumulative discounted future rewards). We view such distributions, which we also refer to as optimality profiles, as summaries of the degree of optimality of the demonstrations that may, for example, reflect the opinion of a human expert. Given an optimality profile and a small amount of additional supervision, our algorithm fits a reward function, modeled as a neural network, by essentially minimizing the Wasserstein distance between the corresponding induced distribution and the optimality profile. We show that our method is capable of learning reward functions such that policies trained to optimize them outperform the demonstrations used for fitting the reward functions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题