动态DNN符合移动和嵌入式平台上的运行时资源管理

论文标题

动态DNN符合移动和嵌入式平台上的运行时资源管理

Dynamic DNNs Meet Runtime Resource Management on Mobile and Embedded Platforms

论文作者

Xun, Lei, Al-Hashimi, Bashir M., Hare, Jonathon, Merrett, Geoff V.

论文摘要

由于延迟较低和隐私，深度神经网络（DNN）推论越来越多地在移动和嵌入式平台上执行。但是，由于密集的计算和内存访问，在这些平台上有效部署很具有挑战性。我们为DNN性能和能源优化提供了整体系统设计，结合了算法和硬件中的权衡机会。该系统可以视为三个抽象层：设备层包含异质计算资源；应用层具有多个并发工作负载；运行时资源管理层监视动态变化的算法的性能目标以及硬件资源和约束，并试图通过同时调整算法和硬件来满足它们。此外，我们通过“曾经是所有网络”的动态版本（即动态OFA）说明了运行时方法，该版本可以将Convnet体系结构扩展到有效地拟合异质计算资源，并对诸如Transformer等不同模型体系结构（例如Transformer）具有良好的概括。与最先进的动态DNN相比，我们在Jetson Xavier NX上使用ImageNet的实验结果表明，动态OFA最高为3.5倍（CPU），2.4倍（GPU），对于类似的Imagenet Top-1精度，类似的Imagenet Top-1精度，或3.8％（CPU），5.1％（GPU）的精度较高。此外，与Linux州长（例如性能，Schedutil）相比，我们的运行时方法在相似的延迟下将能耗降低了16.5％。

Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to low latency and better privacy. However, efficient deployment on these platforms is challenging due to the intensive computation and memory access. We propose a holistic system design for DNN performance and energy optimisation, combining the trade-off opportunities in both algorithms and hardware. The system can be viewed as three abstract layers: the device layer contains heterogeneous computing resources; the application layer has multiple concurrent workloads; and the runtime resource management layer monitors the dynamically changing algorithms' performance targets as well as hardware resources and constraints, and tries to meet them by tuning the algorithm and hardware at the same time. Moreover, We illustrate the runtime approach through a dynamic version of 'once-for-all network' (namely Dynamic-OFA), which can scale the ConvNet architecture to fit heterogeneous computing resources efficiently and has good generalisation for different model architectures such as Transformer. Compared to the state-of-the-art Dynamic DNNs, our experimental results using ImageNet on a Jetson Xavier NX show that the Dynamic-OFA is up to 3.5x (CPU), 2.4x (GPU) faster for similar ImageNet Top-1 accuracy, or 3.8% (CPU), 5.1% (GPU) higher accuracy at similar latency. Furthermore, compared with Linux governor (e.g. performance, schedutil), our runtime approach reduces the energy consumption by 16.5% at similar latency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题