近端确定性政策梯度

论文标题

近端确定性政策梯度

Proximal Deterministic Policy Gradient

论文作者

Maggipinto, Marco, Susto, Gian Antonio, Chaudhari, Pratik

论文摘要

本文介绍了两种简单的技术，以改善政体钢筋学习（RL）算法。首先，我们将非政策RL作为随机近端迭代。目标网络扮演优化变量的角色，值网络计算近端运算符。其次，我们利用了最先进的非货币算法中常见的两个值函数，以通过自举通过boottagpaging提供改进的动作价值估算，而计算资源的增加有限。此外，我们证明了对标准连续控制基准测试的最先进算法的性能改善。

This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a stochastic proximal point iteration. The target network plays the role of the variable of optimization and the value network computes the proximal operator. Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of computational resources. Further, we demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题