论文标题
SoftMax深层确定性政策梯度
Softmax Deep Double Deterministic Policy Gradients
论文作者
论文摘要
一种广泛使用的参与者批判性的增强学习算法,用于连续控制,深层确定性政策梯度(DDPG),遭受了高估问题的困扰,可能会对绩效产生负面影响。尽管最先进的双胞胎延迟了深层确定性政策梯度(TD3)算法减轻了高估问题,但它可能导致很大的低估偏见。在本文中,我们建议将Boltzmann SoftMax运算符进行连续控制中的价值函数估计。我们首先在连续的动作空间中对软磁性运算符进行了分析。然后,我们发现了SoftMax运算符在Actor-Critic算法中的重要属性,即,它有助于平滑优化景观,这为操作员的好处提供了新的启示。我们还通过将SoftMax Operator构建在单个和双重估计器上,可以有效地改善高估和低估偏差,从而设计两种新算法,即SoftMax Deep Deep确定性政策梯度(SD2)和SoftMax Deep Double确定性策略梯度(SD3)。我们对挑战持续控制任务进行了广泛的实验,结果表明SD3的表现优于最先进的方法。
A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous control. We first theoretically analyze the softmax operator in continuous action space. Then, we uncover an important property of the softmax operator in actor-critic algorithms, i.e., it helps to smooth the optimization landscape, which sheds new light on the benefits of the operator. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators, which can effectively improve the overestimation and underestimation bias. We conduct extensive experiments on challenging continuous control tasks, and results show that SD3 outperforms state-of-the-art methods.