自动化设计空间探索，以优化DNN在ARM Cortex-A CPU上的部署

论文标题

自动化设计空间探索，以优化DNN在ARM Cortex-A CPU上的部署

Automated Design Space Exploration for optimised Deployment of DNN on Arm Cortex-A CPUs

论文作者

de Prado, Miguel, Mundy, Andrew, Saeed, Rabia, Denna, Maurizio, Pazos, Nuria, Benini, Luca

论文摘要

深度学习在嵌入式设备上的传播促使开发了许多方法来优化深神经网络的部署（DNN）。作品主要关注：i）有效的DNN体系结构，ii）网络优化技术，例如修剪和定量，iii）优化算法，以加快计算最密集的层的执行以及iv）专用硬件，以加速数据流和计算。但是，由于方法太大而无法测试并获得全球优化的解决方案，因此缺乏对跨层次优化的研究。因此，从延迟，准确性和内存方面导致次优部署。在这项工作中，我们首先详细介绍并分析改善DNN在不同级别的软件优化级别的部署的方法。在这些知识的基础上，我们提出了一个自动探索框架，以简化DNN的部署。该框架依赖于强化学习搜索，该搜索与深度学习推理框架相结合，自动探索了设计空间并学习了优化的解决方案，该解决方案可以加快性能并减少嵌入式CPU平台上的内存。因此，我们在一系列ARM Cortex-A CPU平台上为最先进的DNN提供了一系列结果，相对于BLAS浮点浮点的实施，精度的损失可忽略不计，可降低性能的4倍和超过2倍的记忆力。

The spread of deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN). Works have mainly focused on: i) efficient DNN architectures, ii) network optimisation techniques such as pruning and quantisation, iii) optimised algorithms to speed up the execution of the most computational intensive layers and, iv) dedicated hardware to accelerate the data flow and computation. However, there is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution. Thus, leading to suboptimal deployment in terms of latency, accuracy, and memory. In this work, we first detail and analyse the methods to improve the deployment of DNNs across the different levels of software optimisation. Building on this knowledge, we present an automated exploration framework to ease the deployment of DNNs. The framework relies on a Reinforcement Learning search that, combined with a deep learning inference framework, automatically explores the design space and learns an optimised solution that speeds up the performance and reduces the memory on embedded CPU platforms. Thus, we present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory with negligible loss in accuracy with respect to the BLAS floating-point implementation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题