阶段政策梯度

论文标题

阶段政策梯度

Phasic Policy Gradient

论文作者

Cobbe, Karl, Hilton, Jacob, Klimov, Oleg, Schulman, John

论文摘要

我们介绍了Phasic Policy梯度（PPG），这是一个加强学习框架，通过将政策和价值功能培训分为不同的阶段，从而修改了传统的派利演员 - 批判性方法。在先前的方法中，必须选择使用共享网络或单独的网络来表示策略和价值函数。使用单独的网络避免了目标之间的干扰，而使用共享网络则可以共享有用的功能。 PPG能够通过将优化分为两个阶段来实现两全其美，这是一个阶段，一个阶段可以进步训练，一种蒸馏特征。 PPG还使值函数可以通过更高级别的样本重用来更积极地优化。与PPO相比，我们发现PPG显着提高了具有挑战性的Procgen基准的样品效率。

We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives, while using a shared network allows useful features to be shared. PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features. PPG also enables the value function to be more aggressively optimized with a higher level of sample reuse. Compared to PPO, we find that PPG significantly improves sample efficiency on the challenging Procgen Benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题