论文标题
与过渡动态不匹配的仅国家模仿
State-only Imitation with Transition Dynamics Mismatch
论文作者
论文摘要
模仿学习(IL)是培训代理人通过利用专家行为实现复杂目标的流行范式,而不是处理设计正确的奖励功能的困难。通过将环境建立为马尔可夫决策过程(MDP),大多数现有的IL算法都取决于与要学习新的模仿策略相同的MDP的专家演示。这是许多现实生活中的不典型性的,在这些情况下,专家和模仿者MDP之间的差异很常见,尤其是在过渡动力学功能中。此外,获得专家行为可能是昂贵的或不可行的,这使得最新的IL趋势(专家示威构成仅州或观察的情况)如此有前途。在本文中,我们在本文中提出了一种新的仅国际IL算法,这是基于最新的对抗性模仿方法。它通过引入间接步骤并迭代解决子问题,将整体优化目标分为两个子问题。我们表明,当专家和模仿者MDP之间存在过渡动力不匹配时,我们的算法特别有效,而基线IL方法则遭受性能降解。为了分析这一点,我们通过修改OpenAI体育馆的Mujoco机车任务的配置参数来构建几个有趣的MDP。
Imitation Learning (IL) is a popular paradigm for training agents to achieve complicated goals by leveraging expert behavior, rather than dealing with the hardships of designing a correct reward function. With the environment modeled as a Markov Decision Process (MDP), most of the existing IL algorithms are contingent on the availability of expert demonstrations in the same MDP as the one in which a new imitator policy is to be learned. This is uncharacteristic of many real-life scenarios where discrepancies between the expert and the imitator MDPs are common, especially in the transition dynamics function. Furthermore, obtaining expert actions may be costly or infeasible, making the recent trend towards state-only IL (where expert demonstrations constitute only states or observations) ever so promising. Building on recent adversarial imitation approaches that are motivated by the idea of divergence minimization, we present a new state-only IL algorithm in this paper. It divides the overall optimization objective into two subproblems by introducing an indirection step and solves the subproblems iteratively. We show that our algorithm is particularly effective when there is a transition dynamics mismatch between the expert and imitator MDPs, while the baseline IL methods suffer from performance degradation. To analyze this, we construct several interesting MDPs by modifying the configuration parameters for the MuJoCo locomotion tasks from OpenAI Gym.