论文标题
因果模仿学习与未观察到的混杂因素
Causal Imitation Learning with Unobserved Confounders
论文作者
论文摘要
儿童学习的常见方式之一是模仿成年人。模仿学习的重点是从专家产生的示范,未指定的绩效指标和未观察到的奖励信号的示范中进行的学习政策。模仿学习的流行方法首先直接模仿专家的行为政策(行为克隆)或学习优先考虑观察到的专家轨迹(逆强化学习)的奖励功能。但是,这些方法依赖于以下假设:专家用来确定其行为的协变量得到了完全观察到的。在本文中,当学习者的感官输入和专家不同时,我们将放松这一假设和学习模仿学习。首先,我们提供了一个非参数,图形标准,该标准是从演示数据的组合和关于基本环境的定性假设组合来确定模仿的可行性的,该标准以因果模型的形式表示。然后,我们表明,当这种标准不满足时,通过利用专家轨迹的定量知识,模仿仍然可以是可行的。最后,我们开发了一个有效的程序,可以从专家的轨迹中学习模仿政策。
One of the common ways children learn is by mimicking adults. Imitation learning focuses on learning policies with suitable performance from demonstrations generated by an expert, with an unspecified performance measure, and unobserved reward signal. Popular methods for imitation learning start by either directly mimicking the behavior policy of an expert (behavior cloning) or by learning a reward function that prioritizes observed expert trajectories (inverse reinforcement learning). However, these methods rely on the assumption that covariates used by the expert to determine her/his actions are fully observed. In this paper, we relax this assumption and study imitation learning when sensory inputs of the learner and the expert differ. First, we provide a non-parametric, graphical criterion that is complete (both necessary and sufficient) for determining the feasibility of imitation from the combinations of demonstration data and qualitative assumptions about the underlying environment, represented in the form of a causal model. We then show that when such a criterion does not hold, imitation could still be feasible by exploiting quantitative knowledge of the expert trajectories. Finally, we develop an efficient procedure for learning the imitating policy from experts' trajectories.