环状策略蒸馏：带有域随机化的样品效率的SIM到真实加固学习

论文标题

环状策略蒸馏：带有域随机化的样品效率的SIM到真实加固学习

Cyclic Policy Distillation: Sample-Efficient Sim-to-Real Reinforcement Learning with Domain Randomization

论文作者

Kadokawa, Yuki, Zhu, Lingwei, Tsurumine, Yoshihisa, Matsubara, Takamitsu

论文摘要

用域随机化的深度强化学习在各种模拟中学习了一个控制策略，其中具有随机物理和传感器模型参数，以零弹性设置可以转移到现实世界。但是，由于策略更新的不稳定性，通常需要大量样本来学习有效的政策。为了减轻此问题，我们提出了一种名为Cyclic策略蒸馏（CPD）的样本效率方法。 CPD将随机参数的范围划分为几个小子域，并为每个策略分配一个本地策略。然后在周期性过渡到子域的同时学习本地政策。 CPD通过基于预期绩效改进的知识转移来加速学习。最后，所有博学的本地政策都被蒸馏到SIM到现实转移的全球政策中。 CPD的有效性和样品效率通过四个任务（来自Mujoco的OpenAigym和Pusher，Pusper，游泳者和Halfcheetah的钟形）的模拟证明，以及一项真实机器人的Ball-Dispersal Task。我们在https://github.com/yuki-kadokawa/cyclic-policy-distillation上发布了实验中的代码和视频。

Deep reinforcement learning with domain randomization learns a control policy in various simulations with randomized physical and sensor model parameters to become transferable to the real world in a zero-shot setting. However, a huge number of samples are often required to learn an effective policy when the range of randomized parameters is extensive due to the instability of policy updates. To alleviate this problem, we propose a sample-efficient method named cyclic policy distillation (CPD). CPD divides the range of randomized parameters into several small sub-domains and assigns a local policy to each one. Then local policies are learned while cyclically transitioning to sub-domains. CPD accelerates learning through knowledge transfer based on expected performance improvements. Finally, all of the learned local policies are distilled into a global policy for sim-to-real transfers. CPD's effectiveness and sample efficiency are demonstrated through simulations with four tasks (Pendulum from OpenAIGym and Pusher, Swimmer, and HalfCheetah from Mujoco), and a real-robot, ball-dispersal task. We published code and videos from our experiments at https://github.com/yuki-kadokawa/cyclic-policy-distillation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题