论文标题
PC-PG:策略封面的定向探索可证明的政策梯度学习
PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning
论文作者
论文摘要
由于各种原因,用于加强学习的直接策略梯度方法是一种成功的方法:它们是无模型的,它们直接优化了感兴趣的性能指标,并且允许参数化的策略丰富。他们的主要缺点是,通过本地本质上,他们无法充分探索环境。相反,尽管基于模型的方法和Q学习直接通过使用乐观直接处理探索,但它们处理模型错误指定和函数近似的能力远远不够明显。这项工作介绍了政策覆盖 - 政策梯度(PC-PG)算法,该算法可以使用学识渊博的政策集合(政策范围)来平衡勘探与剥削权衡取舍。 PC-PG享有多项式样本的复杂性,并且在无限尺寸RKHS中的表格MDP和更一般而言的线性MDP都具有运行时间。此外,在模型错误指定下,PC-PG还具有强大的保证,超出了标准最差的情况$ \ ell _ {\ infty} $假设;这包括在平均情况误差假设下的状态汇总的近似保证,以及在控制分配偏移下的近似误差的更一般假设下的保证。我们通过在无奖励和奖励驱动的环境中跨各种领域的经验评估来补充该理论。
Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case $\ell_{\infty}$ assumptions; this includes approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.