论文标题
进化随机政策蒸馏
Evolutionary Stochastic Policy Distillation
论文作者
论文摘要
由于奖励信号的稀疏性,解决目标条件的奖励稀疏(GCR)任务是一个具有挑战性的加强学习问题。在这项工作中,我们从状态空间上漂移随机行走的角度提出了GCRS任务的新表述,并设计了一种名为“进化随机策略蒸馏(ESPD)”的新颖方法,以基于减少随机过程的第一个撞击时间的洞察力来解决它们。作为一种自我刺激方法,ESPD使目标政策能够通过策略蒸馏技术(PD)从其一系列随机变体中学习。 ESPD的学习机制可以视为一种演化策略(ES),该策略直接在动作空间上应用策略,并具有选择功能来检查随机变体的优越性,然后使用PD来更新策略。基于Mujoco Robotics Control Suite的实验显示了该方法的高学习效率。
Solving the Goal-Conditioned Reward Sparse (GCRS) task is a challenging reinforcement learning problem due to the sparsity of reward signals. In this work, we propose a new formulation of GCRS tasks from the perspective of the drifted random walk on the state space, and design a novel method called Evolutionary Stochastic Policy Distillation (ESPD) to solve them based on the insight of reducing the First Hitting Time of the stochastic process. As a self-imitate approach, ESPD enables a target policy to learn from a series of its stochastic variants through the technique of policy distillation (PD). The learning mechanism of ESPD can be considered as an Evolution Strategy (ES) that applies perturbations upon policy directly on the action space, with a SELECT function to check the superiority of stochastic variants and then use PD to update the policy. The experiments based on the MuJoCo robotics control suite show the high learning efficiency of the proposed method.