论文标题
半流式体系结构:用于FPGA的CNN实现的新设计范式
Semi-Streaming Architecture: A New Design Paradigm for CNN Implementation on FPGAs
论文作者
论文摘要
深度学习的最新研究进展导致了小型和强大的卷积神经网络(CNN)体系结构的发展。同时,现场可编程栅极阵列(FPGA)已成为其部署的流行硬件目标选择,分为两个主要实现类别:流式硬件架构和单个计算引擎设计方法。流式硬件体系结构通常需要将每一层作为离散处理单元实现,并且适用于可以适合其展开版本的较小软件模型,以资源约束目标。另一方面,可以将单个计算引擎缩放到适合设备中以执行不同尺寸和复杂性的CNN模型,但是,具有不同工作负载属性的CNN的可实现性能可能会变化,从而导致硬件资源利用效率低下。通过梳理上述两种方法的优点,这项工作提出了一个新的设计范式,称为半流式体系结构,其中分层专门的可配置引擎用于网络实现。作为概念的证明,本文提出了一组五层专门的可配置处理引擎,用于实施8位量化MobileneV2 CNN模型。将ENGINE链接为部分保留数据流并单独调整以有效处理特定类型的层:归一化残差添加,深度,侧面(扩展和投影)以及能够提供5.4Gop/s,16gop/s,16gop/s,27.2gop/s,27.2Gop/s,27.2gop/s和89.6gop/s的标准2D卷积层。 5.32GOP/s/w在100MHz系统时钟,需要在XCZU7EV SOC FPGA上的总功率为6.2W。
The recent research advances in deep learning have led to the development of small and powerful Convolutional Neural Network (CNN) architectures. Meanwhile Field Programmable Gate Arrays (FPGAs) has become a popular hardware target choice for their deployment, splitting into two main implementation categories: streaming hardware architectures and single computation engine design approaches. The streaming hardware architectures generally require implementing every layer as a discrete processing unit, and are suitable for smaller software models that could fit in their unfolded versions into resource-constrained targets. On the other hand, single computation engines can be scaled to fit into a device to execute CNN models of different sizes and complexities, however, the achievable performance of one-size-fits-all implementations may vary across CNNs with different workload attributes leading to inefficient utilization of hardware resources. By combing the advantages of both of the above methods, this work proposes a new design paradigm called semi-streaming architecture, where layerspecialized configurable engines are used for network realization. As a proof of concept this paper presents a set of five layerspecialized configurable processing engines for implementing 8-bit quantized MobilenevV2 CNN model. The engines are chained to partially preserve data streaming and tuned individually to efficiently process specific types of layers: normalized addition of residuals, depthwise, pointwise (expansion and projection), and standard 2D convolution layers capable of delivering 5.4GOp/s, 16GOp/s, 27.2GOp/s, 27.2GOp/s and 89.6GOp/s, respectively, with the overall energy efficiency of 5.32GOp/s/W at a 100MHz system clock, requiring total power of 6.2W on a XCZU7EV SoC FPGA.