论文标题
通过模仿多个甲骨文来改进政策
Policy Improvement via Imitation of Multiple Oracles
论文作者
论文摘要
尽管有希望,但强化学习的现实领养受到了昂贵的探索来学习良好政策的需求。模仿学习(IL)通过在培训期间使用Oracle政策作为引导程序来加速学习过程,从而减轻了这一缺点。但是,在许多实际情况下,学习者可以访问多个次优的甲骨文,这可能会在一个州提供矛盾的建议。现有的IL文献对这种情况提供了有限的处理。尽管在单口径案件中,甲骨文政策的归还为学习者竞争提供了一个明显的基准,而这种基准也不是以多门场设置而闻名的基准或优于其优于它的原则性方法。在本文中,我们提出了甲骨文政策价值观的最大最大值,作为解决多个甲骨文的矛盾建议的自然基准。利用将政策优化降低到在线学习中,我们引入了一种新颖的IL算法Mamba,可以证明可以通过此基准学习政策竞争。特别是,Mamba通过使用广义优势估计(GAE)风格的梯度估计器来优化策略。我们的理论分析表明,这种设计使MAMBA稳健,并使它的表现能超过Oracle策略,即使在单口径案例中,也比IL状态更大。在针对GAE和ACTREVATE(D)的标准政策梯度的评估中,我们展示了Mamba从单个和多个弱甲骨文中利用演示的能力,并显着加快了策略优化。
Despite its promise, reinforcement learning's real-world adoption has been hampered by the need for costly exploration to learn a good policy. Imitation learning (IL) mitigates this shortcoming by using an oracle policy during training as a bootstrap to accelerate the learning process. However, in many practical situations, the learner has access to multiple suboptimal oracles, which may provide conflicting advice in a state. The existing IL literature provides a limited treatment of such scenarios. Whereas in the single-oracle case, the return of the oracle's policy provides an obvious benchmark for the learner to compete against, neither such a benchmark nor principled ways of outperforming it are known for the multi-oracle setting. In this paper, we propose the state-wise maximum of the oracle policies' values as a natural baseline to resolve conflicting advice from multiple oracles. Using a reduction of policy optimization to online learning, we introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark. In particular, MAMBA optimizes policies by using a gradient estimator in the style of generalized advantage estimation (GAE). Our theoretical analysis shows that this design makes MAMBA robust and enables it to outperform the oracle policies by a larger margin than the IL state of the art, even in the single-oracle case. In an evaluation against standard policy gradient with GAE and AggreVaTe(D), we showcase MAMBA's ability to leverage demonstrations both from a single and from multiple weak oracles, and significantly speed up policy optimization.