稀疏的收缩期张量阵列可高效CNN硬件加速度

论文标题

稀疏的收缩期张量阵列可高效CNN硬件加速度

Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration

论文作者

Liu, Zhi-Gang, Whatmough, Paul N., Mattina, Matthew

论文摘要

移动设备上的卷积神经网络（CNN）推断需要有效的低精度（INT8）常规矩阵乘法（GEMM）的硬件加速。利用数据稀疏性是一种进一步加速GEMM进行CNN推断的常见方法，尤其是结构稀疏具有可预测的负载平衡和非常低的索引开销的优势。在本文中，我们解决了结构上的稀疏性的关键架构挑战：如何在维持高利用硬件的同时为一系列稀疏水平提供支持。我们描述了可变密度结合块（VDBB）稀疏性的时间展开的公式，该公式允许在恒定利用时每块可配置数量的非零元素。然后，我们描述了一个收缩阵列微体系结构，该微体系结构实现了该方案，并具有两个数据重用优化。首先，我们通过增加每个PE的MAC数量来增加操作数和部分产品的重用。其次，我们引入了一种新颖的方法，将IM2COL转换转换到硬件中，这使我们能够在DataPath消耗操作数之前实现3倍数据带宽扩展，从而减少SRAM功耗。体重稀疏，激活稀疏性和数据重用的优化都是相互关联的，因此最佳组合并不明显。因此，我们执行设计空间评估，以找到帕累托最佳的设计特征。所得的设计在16nm中获得16.8个顶部/W，型号为50％的型号稀疏性和尺度为55.7台/w，为87.5％。除了成功地证明DBB技术的成功外，该结果明显优于先前报道的CNN加速器稀疏。

Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead. In this paper, we address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware. We describe a time unrolled formulation of variable density-bound block (VDBB) sparsity that allows for a configurable number of non-zero elements per block, at constant utilization. We then describe a systolic array microarchitecture that implements this scheme, with two data reuse optimizations. Firstly, we increase reuse in both operands and partial products by increasing the number of MACs per PE. Secondly, we introduce a novel approach of moving the IM2COL transform into the hardware, which allows us to achieve a 3x data bandwidth expansion just before the operands are consumed by the datapath, reducing the SRAM power consumption. The optimizations for weight sparsity, activation sparsity and data reuse are all interrelated and therefore the optimal combination is not obvious. Therefore, we perform an design space evaluation to find the pareto-optimal design characteristics. The resulting design achieves 16.8 TOPS/W in 16nm with modest 50% model sparsity and scales with model sparsity up to 55.7TOPS/W at 87.5%. As well as successfully demonstrating the variable DBB technique, this result significantly outperforms previously reported sparse CNN accelerators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题