定期更新确定性策略梯度算法

论文标题

定期更新确定性策略梯度算法

Regularly Updated Deterministic Policy Gradient Algorithm

论文作者

Han, Shuai, Zhou, Wenbo, Lü, Shuai, Yu, Jiayu

论文摘要

深层确定性政策梯度（DDPG）算法是最著名的强化学习方法之一。但是，该方法在实际应用中效率低下且不稳定。另一方面，目标函数中Q估计的偏差和方差有时很难控制。本文提出了针对这些问题的定期更新的确定性（RUD）策略梯度算法。从理论上讲，本文证明，与传统过程相比，使用RUD的学习过程可以更好地利用重播缓冲区的新数据。此外，RUD中Q值的较低差异更适合当前剪切的双Q学习策略。本文设计了一个比较实验，与以前的方法，对原始DDPG的消融实验以及在Mujoco环境中进行的其他分析实验。实验结果证明了RUD的有效性和优势。

Deep Deterministic Policy Gradient (DDPG) algorithm is one of the most well-known reinforcement learning methods. However, this method is inefficient and unstable in practical applications. On the other hand, the bias and variance of the Q estimation in the target function are sometimes difficult to control. This paper proposes a Regularly Updated Deterministic (RUD) policy gradient algorithm for these problems. This paper theoretically proves that the learning procedure with RUD can make better use of new data in replay buffer than the traditional procedure. In addition, the low variance of the Q value in RUD is more suitable for the current Clipped Double Q-learning strategy. This paper has designed a comparison experiment against previous methods, an ablation experiment with the original DDPG, and other analytical experiments in Mujoco environments. The experimental results demonstrate the effectiveness and superiority of RUD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题