论文标题
自适应和多个时间尺度的资格痕迹,用于在线深度强化学习
Adaptive and Multiple Time-scale Eligibility Traces for Online Deep Reinforcement Learning
论文作者
论文摘要
深度强化学习(DRL)是一种教授机器人执行复杂任务的有前途的方法。由于直接重用存储的体验数据的方法不能随着时间变化的环境而在机器人问题中的变化,因此需要在线DRL。资格跟踪方法是一种在线学习技术,可通过线性回归器而不是DRL提高传统增强学习的样本效率。深神经网络参数之间的依赖性将破坏资格痕迹,这就是为什么它们不与DRL集成的原因。尽管用最有影响力的梯度代替梯度,而不是积累梯度,因为资格痕迹可以减轻此问题,但替换操作却减少了以前的经验的重用数量。为了解决这些问题,本研究提出了一种新的资格痕迹方法,该方法即使在DRL中也可以使用,同时保持了高样本效率。当累积梯度与使用最新参数计算的梯度不同时,提出的方法考虑了过去和最新参数之间的差异,以适应衰减资格痕迹。由于过去和最新参数之间差异的计算成本,过去和最新参数计算出的输出之间的差异是被利用的。此外,首次设计具有多个时间尺度迹线的广义方法。该设计允许更换最具影响力的自适应积累(衰减)资格痕迹。
Deep reinforcement learning (DRL) is one promising approach to teaching robots to perform complex tasks. Because methods that directly reuse the stored experience data cannot follow the change of the environment in robotic problems with a time-varying environment, online DRL is required. The eligibility traces method is well known as an online learning technique for improving sample efficiency in traditional reinforcement learning with linear regressors rather than DRL. The dependency between parameters of deep neural networks would destroy the eligibility traces, which is why they are not integrated with DRL. Although replacing the gradient with the most influential one rather than accumulating the gradients as the eligibility traces can alleviate this problem, the replacing operation reduces the number of reuses of previous experiences. To address these issues, this study proposes a new eligibility traces method that can be used even in DRL while maintaining high sample efficiency. When the accumulated gradients differ from those computed using the latest parameters, the proposed method takes into account the divergence between the past and latest parameters to adaptively decay the eligibility traces. Bregman divergences between outputs computed by the past and latest parameters are exploited due to the infeasible computational cost of the divergence between the past and latest parameters. In addition, a generalized method with multiple time-scale traces is designed for the first time. This design allows for the replacement of the most influential adaptively accumulated (decayed) eligibility traces.