TARGET网络的T-Soft更新用于深入增强学习

论文标题

TARGET网络的T-Soft更新用于深入增强学习

t-Soft Update of Target Network for Deep Reinforcement Learning

论文作者

Kobayashi, Taisuke, Ilboudo, Wendyam Eric Lionel

论文摘要

本文提出了一个新的强大更新针对深入增强学习（DRL）的强大更新规则，以替换传统的更新规则，以指数式移动平均值给出。目标网络用于平稳生成DRL中主网络的参考信号，从而减少学习差异。其常规更新规则的问题是，即使其中一些试图向错误的方向更新，也以相同的速度将所有参数都以相同的速度顺利复制。这种行为增加了产生错误的参考信号的风险。尽管放慢整体更新速度是减轻错误更新的一种天真方法，但它将降低学习速度。为了在保持学习速度的同时稳健地更新参数，t-soft Update方法的灵感来自Student-T Distribution，参考指数移动平均线与正常分布之间的类比。通过对派生的T-Soft更新的分析，我们表明它接管了Student-T发行版的属性。具体而言，T-Soft更新具有重型属性，因此T-Soft更新自动排除了与过去经验不同的极端更新。此外，当更新类似于过去的经验时，它可以通过增加更新的数量来减轻学习延迟。在DRL的Pybullet机器人模拟中，具有T-Soft Update的在线参与者批评算法在获得的返回和/或其方差方面优于常规方法。从T-Soft更新的培训过程中，我们发现T-Soft更新在全球范围内与标准软更新一致，并且更新速率在本地调整以进行加速或抑制。

This paper proposes a new robust update rule of target network for deep reinforcement learning (DRL), to replace the conventional update rule, given as an exponential moving average. The target network is for smoothly generating the reference signals for a main network in DRL, thereby reducing learning variance. The problem with its conventional update rule is the fact that all the parameters are smoothly copied with the same speed from the main network, even when some of them are trying to update toward the wrong directions. This behavior increases the risk of generating the wrong reference signals. Although slowing down the overall update speed is a naive way to mitigate wrong updates, it would decrease learning speed. To robustly update the parameters while keeping learning speed, a t-soft update method, which is inspired by student-t distribution, is derived with reference to the analogy between the exponential moving average and the normal distribution. Through the analysis of the derived t-soft update, we show that it takes over the properties of the student-t distribution. Specifically, with a heavy-tailed property of the student-t distribution, the t-soft update automatically excludes extreme updates that differ from past experiences. In addition, when the updates are similar to the past experiences, it can mitigate the learning delay by increasing the amount of updates. In PyBullet robotics simulations for DRL, an online actor-critic algorithm with the t-soft update outperformed the conventional methods in terms of the obtained return and/or its variance. From the training process by the t-soft update, we found that the t-soft update is globally consistent with the standard soft update, and the update rates are locally adjusted for acceleration or suppression.

下载PDF全文

下载文献需遵守相关版权规定

论文标题