论文标题
使用双轨优势估算器的软策略优化
Soft policy optimization using dual-track advantage estimator
论文作者
论文摘要
在强化学习(RL)中,我们始终希望代理在训练的初始阶段探索尽可能多的状态,并在随后的阶段利用探索信息,以发现最可返回的轨迹。基于这一原则,在本文中,我们通过引入熵并动态设置温度系数来平衡探索和剥削的机会来软化近端策略优化。在最大程度地提高预期奖励的同时,代理商还将寻求其他轨迹,以避免当地的最佳政策。然而,熵引起的随机性增加将在早期阶段降低火车速度。整合时间差异(TD)方法和一般优势估计器(GAE),我们提出了双轨优势估计器(DTAE),以加速价值函数的收敛性并进一步增强算法的性能。与Mujoco环境上的其他上政策RL算法相比,提出的方法不仅可以显着加快训练的速度,而且还取得了累积回报的最先进结果。
In reinforcement learning (RL), we always expect the agent to explore as many states as possible in the initial stage of training and exploit the explored information in the subsequent stage to discover the most returnable trajectory. Based on this principle, in this paper, we soften the proximal policy optimization by introducing the entropy and dynamically setting the temperature coefficient to balance the opportunity of exploration and exploitation. While maximizing the expected reward, the agent will also seek other trajectories to avoid the local optimal policy. Nevertheless, the increase of randomness induced by entropy will reduce the train speed in the early stage. Integrating the temporal-difference (TD) method and the general advantage estimator (GAE), we propose the dual-track advantage estimator (DTAE) to accelerate the convergence of value functions and further enhance the performance of the algorithm. Compared with other on-policy RL algorithms on the Mujoco environment, the proposed method not only significantly speeds up the training but also achieves the most advanced results in cumulative return.