实施在深度政策梯度中很重要：关于PPO和TRPO的案例研究

论文标题

实施在深度政策梯度中很重要：关于PPO和TRPO的案例研究

Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO

论文作者

Engstrom, Logan, Ilyas, Andrew, Santurkar, Shibani, Tsipras, Dimitris, Janoos, Firdaus, Rudolph, Larry, Madry, Aleksander

论文摘要

我们通过对两种流行算法的案例研究研究算法进步的根源：近端政策优化（PPO）和信任区域策略优化（TRPO）。具体而言，我们研究了“代码级优化”的后果：仅在实现中发现或被描述为核心算法的辅助详细信息。似乎具有次要的重要性，这种优化对代理行为产生了重大影响。我们的结果表明，他们（a）负责PPO对TRPO的累积奖励增长，并且（b）从根本上改变了RL方法的运作方式。这些见解表明，将绩效提高归因于深度强化学习的困难和重要性。可重现我们的结果的代码，请访问https://github.com/madrylab/implementation-matters。

We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms: Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO). Specifically, we investigate the consequences of "code-level optimizations:" algorithm augmentations found only in implementations or described as auxiliary details to the core algorithm. Seemingly of secondary importance, such optimizations turn out to have a major impact on agent behavior. Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function. These insights show the difficulty and importance of attributing performance gains in deep reinforcement learning. Code for reproducing our results is available at https://github.com/MadryLab/implementation-matters .

下载PDF全文

下载文献需遵守相关版权规定

论文标题