论文标题
近端确定性政策梯度
Proximal Deterministic Policy Gradient
论文作者
论文摘要
本文介绍了两种简单的技术,以改善政体钢筋学习(RL)算法。首先,我们将非政策RL作为随机近端迭代。目标网络扮演优化变量的角色,值网络计算近端运算符。其次,我们利用了最先进的非货币算法中常见的两个值函数,以通过自举通过boottagpaging提供改进的动作价值估算,而计算资源的增加有限。此外,我们证明了对标准连续控制基准测试的最先进算法的性能改善。
This paper introduces two simple techniques to improve off-policy Reinforcement Learning (RL) algorithms. First, we formulate off-policy RL as a stochastic proximal point iteration. The target network plays the role of the variable of optimization and the value network computes the proximal operator. Second, we exploits the two value functions commonly employed in state-of-the-art off-policy algorithms to provide an improved action value estimate through bootstrapping with limited increase of computational resources. Further, we demonstrate significant performance improvement over state-of-the-art algorithms on standard continuous-control RL benchmarks.