论文标题
先验,层次结构和信息不对称,用于增强学习中的技能转移
Priors, Hierarchy, and Information Asymmetry for Skill Transfer in Reinforcement Learning
论文作者
论文摘要
从过去的经验中发现行为并将其转移到新任务的能力是在现实世界中采取样本效力的智能代理的标志。装备具有相同能力的体现的增强学习者对于成功部署机器人技术可能至关重要。尽管层次结构和KL规范化的强化学习在这里单独持希望,但可以说是混合方法可以结合其各自的好处。这些领域的关键是在建筑模块中使用信息不对称性,以偏向学习哪些技能。尽管不对称选择对可传递性具有很大的影响,但现有方法主要基于与域无关的,潜在的亚最佳方式的直觉。在本文中,我们从理论上和经验上展示了跨顺序任务的至关重要的表达性转移性权衡,并由信息不对称控制。鉴于这种见解,我们介绍了细心的先验,以表达能力和可转移技能(APE),这是一种层次的KL规范化方法,从先验和等级制度中受益匪浅。与现有方法不同,APE通过基于我们的表达性转移性定理以数据驱动的,域依赖性的方式来自动选择不对称性。在不同水平的外推和稀疏度(例如机器人块堆叠)上进行的复杂传输域进行的实验证明了正确的不对称选择的关键性,并且APES的表现极大地超过了先前的方法。
The ability to discover behaviours from past experience and transfer them to new tasks is a hallmark of intelligent agents acting sample-efficiently in the real world. Equipping embodied reinforcement learners with the same ability may be crucial for their successful deployment in robotics. While hierarchical and KL-regularized reinforcement learning individually hold promise here, arguably a hybrid approach could combine their respective benefits. Key to these fields is the use of information asymmetry across architectural modules to bias which skills are learnt. While asymmetry choice has a large influence on transferability, existing methods base their choice primarily on intuition in a domain-independent, potentially sub-optimal, manner. In this paper, we theoretically and empirically show the crucial expressivity-transferability trade-off of skills across sequential tasks, controlled by information asymmetry. Given this insight, we introduce Attentive Priors for Expressive and Transferable Skills (APES), a hierarchical KL-regularized method, heavily benefiting from both priors and hierarchy. Unlike existing approaches, APES automates the choice of asymmetry by learning it in a data-driven, domain-dependent, way based on our expressivity-transferability theorems. Experiments over complex transfer domains of varying levels of extrapolation and sparsity, such as robot block stacking, demonstrate the criticality of the correct asymmetric choice, with APES drastically outperforming previous methods.