论文标题
与火灾作斗争:失败的身份以减轻基于干预的学习中的专家负担
Fighting Failures with FIRE: Failure Identification to Reduce Expert Burden in Intervention-Based Learning
论文作者
论文摘要
监督的模仿学习,也称为行为克隆,遭受分配漂移,导致政策执行过程中的失败。减轻此问题的一种方法是,根据专家的决心,专家在执行任务执行期间的行动可以纠正代理商的行动。然后,使用此新的纠正数据对代理的策略进行重新训练。仅这种方法就可以使高性能的代理人能够学习,但是要大量成本:专家必须警惕地观察执行,直到政策达到指定的成功水平,即使在那时,也无法保证该政策始终会成功。为了解决这些限制,我们提出火灾(失败的身份以减少基于干预的学习的专家负担),该系统可以预测何时运行策略会失败,停止执行并要求专家进行更正。与仅从专家数据中学习的现有方法不同,我们的方法从专家和非专家数据中学习,类似于对抗性学习。我们在一系列具有挑战性的操纵任务中进行了实验证明,我们的方法能够识别导致失败的州行动对。这允许无缝集成到基于干预的学习系统中,在该系统中,与最先进的逆增强学习方法相比,我们显示了样品效率的缩放顺序提高,并且在与行为克隆中学到的同等数据相比,大大提高了性能。
Supervised imitation learning, also known as behavioral cloning, suffers from distribution drift leading to failures during policy execution. One approach to mitigate this issue is to allow an expert to correct the agent's actions during task execution, based on the expert's determination that the agent has reached a `point of no return.' The agent's policy is then retrained using this new corrective data. This approach alone can enable high-performance agents to be learned, but at a substantial cost: the expert must vigilantly observe execution until the policy reaches a specified level of success, and even at that point, there is no guarantee that the policy will always succeed. To address these limitations, we present FIRE (Failure Identification to Reduce Expert Burden in intervention-based learning), a system that can predict when a running policy will fail, halt its execution, and request a correction from the expert. Unlike existing approaches that learn only from expert data, our approach learns from both expert and non-expert data, akin to adversarial learning. We demonstrate experimentally for a series of challenging manipulation tasks that our method is able to recognize state-action pairs that lead to failures. This permits seamless integration into an intervention-based learning system, where we show an order-of-magnitude gain in sample efficiency compared with a state-of-the-art inverse reinforcement learning method and dramatically improved performance over an equivalent amount of data learned with behavioral cloning.