快速终身自适应逆强化从演示中学习

论文标题

快速终身自适应逆强化从演示中学习

Fast Lifelong Adaptive Inverse Reinforcement Learning from Demonstrations

论文作者

Chen, Letian, Jayanthi, Sravan, Paleja, Rohan, Martin, Daniel, Zakharov, Viacheslav, Gombolay, Matthew

论文摘要

从示范中学习（LFD）方法使最终用户能够通过演示所需的行为来教机器人新任务，从而使对机器人技术的访问民主化。但是，当前的LFD框架无法快速适应异质的人类示范，也无法在无处不在的机器人技术应用中进行大规模部署。在本文中，我们提出了一个新型的LFD框架，快速的终身自适应逆增强学习（FLAIR）。我们的方法（1）利用策略来构建策略混合物以快速适应新的演示，从而可以快速最终用户个性化，（2）将常见知识提炼出跨示范，从而实现准确的任务推论；（3）仅在终身部署中需要扩展其模型，并保持一套简洁的原型策略，这些策略可以通过政策混合物近似所有行为。我们从经验上验证了能力可以实现适应性的能力（即机器人适应异质性，用户特定的任务偏好），效率（即机器人可以实现样本适应性）和可伸缩性（即，模型都会随着示范的数量而逐渐增长）。 Flair超过了三项控制任务的基准，平均有57％的政策回报提高了57％，使用政策混合物进行示范建模所需的次数少78％。最后，我们在乒乓球任务中证明了Flair的成功，并发现用户将Flair评为更高的任务（P <.05）和个性化（P <.05）。

Learning from Demonstration (LfD) approaches empower end-users to teach robots novel tasks via demonstrations of the desired behaviors, democratizing access to robotics. However, current LfD frameworks are not capable of fast adaptation to heterogeneous human demonstrations nor the large-scale deployment in ubiquitous robotics applications. In this paper, we propose a novel LfD framework, Fast Lifelong Adaptive Inverse Reinforcement learning (FLAIR). Our approach (1) leverages learned strategies to construct policy mixtures for fast adaptation to new demonstrations, allowing for quick end-user personalization, (2) distills common knowledge across demonstrations, achieving accurate task inference; and (3) expands its model only when needed in lifelong deployments, maintaining a concise set of prototypical strategies that can approximate all behaviors via policy mixtures. We empirically validate that FLAIR achieves adaptability (i.e., the robot adapts to heterogeneous, user-specific task preferences), efficiency (i.e., the robot achieves sample-efficient adaptation), and scalability (i.e., the model grows sublinearly with the number of demonstrations while maintaining high performance). FLAIR surpasses benchmarks across three control tasks with an average 57% improvement in policy returns and an average 78% fewer episodes required for demonstration modeling using policy mixtures. Finally, we demonstrate the success of FLAIR in a table tennis task and find users rate FLAIR as having higher task (p<.05) and personalization (p<.05) performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题