火星：由共同设计的压缩神经网络基于多麦克罗体系结构SRAM CIM的加速器

论文标题

火星：由共同设计的压缩神经网络基于多麦克罗体系结构SRAM CIM的加速器

MARS: Multi-macro Architecture SRAM CIM-Based Accelerator with Co-designed Compressed Neural Networks

论文作者

Sie, Syuan-Hao, Lee, Jye-Luen, Chen, Yi-Ren, Lu, Chih-Cheng, Hsieh, Chih-Cheng, Chang, Meng-Fan, Tang, Kea-Tiong

论文摘要

卷积神经网络（CNN）在深度学习应用中起关键作用。但是，在硬件加速器中，大型存储开销和CNN的实质性计算成本是有问题的。内存计算（CIM）体系结构具有有效计算大型矩阵矢量乘法的巨大潜力。但是，在横杆阵列执行的密集乘积和积累（MAC）操作和CIM宏的有限容量仍然是瓶颈，以进一步提高能源效率和吞吐量。为了降低计算成本，网络修剪和量化是两种广泛研究的压缩方法，以缩小模型大小。但是，大多数模型压缩算法只能在基于数字的CNN加速器中实现。为了在基于CIM的静态随机访问存储器（SRAM）中实现，模型压缩算法必须考虑CIM宏的硬件限制，例如可以同时打开的单词线和位线的数量，以及如何将重量映射到SRAM CIM宏。在这项研究中，提出了一种软件和硬件共同设计方法来设计基于SRAM CIM的CIM CNN加速器和SRAM CIM感知模型压缩算法。为了减少批归归量表（BN）所需的高精度MAC，提出了可以将BN融合到权重的量化算法。此外，为了减少网络参数的数量，提出了一种考虑CIM体系结构的稀疏算法。最后，提出了基于CIM的CNN加速器火星，可以利用多个SRAM CIM宏作为处理单元并支持稀疏神经网络。

Convolutional neural networks (CNNs) play a key role in deep learning applications. However, the large storage overheads and the substantial computation cost of CNNs are problematic in hardware accelerators. Computing-in-memory (CIM) architecture has demonstrated great potential to effectively compute large-scale matrix-vector multiplication. However, the intensive multiply and accumulation (MAC) operations executed at the crossbar array and the limited capacity of CIM macros remain bottlenecks for further improvement of energy efficiency and throughput. To reduce computation costs, network pruning and quantization are two widely studied compression methods to shrink the model size. However, most of the model compression algorithms can only be implemented in digital-based CNN accelerators. For implementation in a static random access memory (SRAM) CIM-based accelerator, the model compression algorithm must consider the hardware limitations of CIM macros, such as the number of word lines and bit lines that can be turned on at the same time, as well as how to map the weight to the SRAM CIM macro. In this study, a software and hardware co-design approach is proposed to design an SRAM CIM-based CNN accelerator and an SRAM CIM-aware model compression algorithm. To lessen the high-precision MAC required by batch normalization (BN), a quantization algorithm that can fuse BN into the weights is proposed. Furthermore, to reduce the number of network parameters, a sparsity algorithm that considers a CIM architecture is proposed. Last, MARS, a CIM-based CNN accelerator that can utilize multiple SRAM CIM macros as processing units and support a sparsity neural network, is proposed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题