通过自我监督的奖励回归从次优示范中学习

论文标题

通过自我监督的奖励回归从次优示范中学习

Learning from Suboptimal Demonstration via Self-Supervised Reward Regression

论文作者

Chen, Letian, Paleja, Rohan, Gombolay, Matthew

论文摘要

从示范中学习（LFD）试图通过使非动物学家最终用户能够通过提供人类示范来教机器人执行任务来使机器人技术民主化。但是，现代LFD技术，例如逆增强学习（IRL）假设用户至少提供随机的最佳演示。在大多数真实世界的情况下，此假设无法保持。最近从最佳演示中学习的尝试利用成对排名并遵循Luce-Shepard规则。但是，我们表明这些方法是错误的假设，因此遭受了脆弱，降解的性能。我们在开发一种新颖的方法中克服了这些局限性，该方法启动了次优的演示，以合成最佳参数化数据以训练理想化的奖励功能。我们从经验上验证我们学习了一个理想化的奖励功能，与地面真相奖励相关性约0.95，而先前工作的奖励范围为〜0.75。然后，我们可以培训政策，比次优的示范提高了约200％的改善，而先前的工作提高了约90％。我们提出了在乒乓球中教机器人的topspin罢工的物理演示，比用户演示，获得32％的回报速度，Topspin的回报速度高40％。

Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration. However, modern LfD techniques, e.g. inverse reinforcement learning (IRL), assume users provide at least stochastically optimal demonstrations. This assumption fails to hold in most real-world scenarios. Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings and following the Luce-Shepard rule. However, we show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance. We overcome these limitations in developing a novel approach that bootstraps off suboptimal demonstrations to synthesize optimality-parameterized data to train an idealized reward function. We empirically validate we learn an idealized reward function with ~0.95 correlation with ground-truth reward versus ~0.75 for prior work. We can then train policies achieving ~200% improvement over the suboptimal demonstration and ~90% improvement over prior work. We present a physical demonstration of teaching a robot a topspin strike in table tennis that achieves 32% faster returns and 40% more topspin than user demonstration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题