自动代理 - 迪斯蒂尔：通过神经体系结构搜索有效的深入强化学习剂

论文标题

自动代理 - 迪斯蒂尔：通过神经体系结构搜索有效的深入强化学习剂

Auto-Agent-Distiller: Towards Efficient Deep Reinforcement Learning Agents via Neural Architecture Search

论文作者

Fu, Yonggan, Yu, Zhongzhi, Zhang, Yongan, Lin, Yingyan Celine

论文摘要

Alphago的惊人表现引发了人们对众多现实应用程序（例如智能机器人技术）开发深入强化学习（DRL）的爆炸性兴趣。但是，DRL的常常令人难以置信的复杂性与许多DRL应用中所需的实时控制和限制资源的困难，从而限制了DRL动力智能设备的巨大潜力。尽管已大力努力用于压缩其他深度学习模型，但现有作品几乎没有触及压缩DRL的表面。在这项工作中，我们首先确定存在DRL的最佳模型大小，该模型大小可以最大程度地提高考试成绩和效率，从而激发了对特定于任务的DRL代理的需求。因此，我们提出了一个自动代理 - 迪斯蒂尔（A2D）框架，据我们所知，该框架是第一个应用于DRL的神经体系结构搜索（NAS），以自动搜索最佳的DRL代理，以优化测试得分和效率的各种任务。具体而言，我们证明了香草NAS由于导致DRL训练稳定性的较高差异，然后开发出一种新颖的蒸馏机制，因此很容易地寻找最佳代理，从而使教师代理人的演员和评论家都稳定搜索过程并提高搜索剂的最佳性。广泛的实验和消融研究一致地验证了我们的发现以及在测试分数和效率中手动设计的A2D，优于手动设计的DRL的优势和一般适用性。所有代码将在接受后发布。

AlphaGo's astonishing performance has ignited an explosive interest in developing deep reinforcement learning (DRL) for numerous real-world applications, such as intelligent robotics. However, the often prohibitive complexity of DRL stands at the odds with the required real-time control and constrained resources in many DRL applications, limiting the great potential of DRL powered intelligent devices. While substantial efforts have been devoted to compressing other deep learning models, existing works barely touch the surface of compressing DRL. In this work, we first identify that there exists an optimal model size of DRL that can maximize both the test scores and efficiency, motivating the need for task-specific DRL agents. We therefore propose an Auto-Agent-Distiller (A2D) framework, which to our best knowledge is the first neural architecture search (NAS) applied to DRL to automatically search for the optimal DRL agents for various tasks that optimize both the test scores and efficiency. Specifically, we demonstrate that vanilla NAS can easily fail in searching for the optimal agents, due to its resulting high variance in DRL training stability, and then develop a novel distillation mechanism to distill the knowledge from both the teacher agent's actor and critic to stabilize the searching process and improve the searched agents' optimality. Extensive experiments and ablation studies consistently validate our findings and the advantages and general applicability of our A2D, outperforming manually designed DRL in both the test scores and efficiency. All the codes will be released upon acceptance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题