论文标题
近端模仿学习
Proximal Point Imitation Learning
论文作者
论文摘要
这项工作开发了具有严格效率的新算法,可确保无限的地平线模仿学习(IL)具有线性函数近似而没有限制性相干假设。我们从问题的最小值开始,然后概述如何从优化中利用经典工具,尤其是近端点方法(PPM)和双重平滑,分别用于在线和离线IL。多亏了PPM,我们避免了在以前的文献中出现的在线IL的嵌套政策评估和成本更新。特别是,我们通过优化单个凸的优化和在成本和Q函数上的平稳目标来消除常规交替的更新。当不确定地解决时,我们将优化错误与回收策略的次优性联系起来。作为额外的奖励,通过将ppm重新解释为双重平滑以专家政策为中心,我们还获得了一个离线IL算法,从而在理论上享有所需的专家轨迹。最后,我们实现了线性和神经网络函数近似的令人信服的经验性能。
This work develops new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning (IL) with linear function approximation without restrictive coherence assumptions. We begin with the minimax formulation of the problem and then outline how to leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing, for online and offline IL, respectively. Thanks to PPM, we avoid nested policy evaluation and cost updates for online IL appearing in the prior literature. In particular, we do away with the conventional alternating updates by the optimization of a single convex and smooth objective over both cost and Q-functions. When solved inexactly, we relate the optimization errors to the suboptimality of the recovered policy. As an added bonus, by re-interpreting PPM as dual smoothing with the expert policy as a center point, we also obtain an offline IL algorithm enjoying theoretical guarantees in terms of required expert trajectories. Finally, we achieve convincing empirical performance for both linear and neural network function approximation.