论文标题

多里(Dory):在低成本物联网上自动端到端部署现实世界DNN

DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs

论文作者

Burrello, Alessio, Garofalo, Angelo, Bruschi, Nazareno, Tagliavini, Giuseppe, Rossi, Davide, Conti, Francesco

论文摘要

深度神经网络(DNNS)在末端节点上的极端边缘的部署是支持普遍的深度学习增强应用程序的关键推动者。基于低成本的MCU最终节点的片上存储器有限,并且经常用刮擦板代替缓存,以减少开销并提高能源效率 - 需要在内存层次结构的不同级别之间进行明确的基于DMA的内存转移。在这些系统上绘制现代DNN需要依赖拓扑的瓷砖和双重延误。在这项工作中,我们提出了Dory(以部署为导向内存) - 一种自动工具,用于在低成本MCUS上部署DNN,通常小于1MB的芯片SRAM内存。 Dory将瓷砖作为约束编程(CP)问题提取:它在每个DNN层施加的拓扑约束下最大化L1存储器利用率。然后,它生成ANSI C代码来协调外部和片上转移和计算阶段。此外,为了最大程度地提高速度,Dory通过启发性能有效的瓷砖尺寸增强了CP配方。作为Dory的案例研究,我们针对GreenWaves Technologies GAP8,这是市场上最先进的超低功率MCU级设备之一。在此设备上,多里(Dory)在单层上的STM32-F746 MCU上,多里(Dory)的MAC/周期高达2.5倍,比绿色专有软件解决方案更好,而最先进的MAC则比最先进的速度更好。使用我们的工具,GAP-8可以对1.0-Mobilenet-128网络的端到端推理,该网络平均消耗63 pj/mac @ 4.3 fps-比STM32-F746好15.4倍。我们将所有开发项目(Dory框架,优化的后端内核和相关的启发式方法)发布为开源软件。

The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency -- requiring explicit DMA-based memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive topology-dependent tiling and double-buffering. In this work, we propose DORY (Deployment Oriented to memoRY) - an automatic tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY augments the CP formulation with heuristics promoting performance-effective tile sizes. As a case study for DORY, we target GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. On this device, DORY achieves up to 2.5x better MAC/cycle than the GreenWaves proprietary software solution and 18.1x better than the state-of-the-art result on an STM32-F746 MCU on single layers. Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps - 15.4x better than an STM32-F746. We release all our developments - the DORY framework, the optimized backend kernels, and the related heuristics - as open-source software.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源