卡拉：具有可重构和低能架构的卷积加速器

论文标题

卡拉：具有可重构和低能架构的卷积加速器

CARLA: A Convolution Accelerator with a Reconfigurable and Low-Energy Architecture

论文作者

Ahmadi, Mehdi, Vakili, Shervin, Langlois, J. M. Pierre

论文摘要

事实证明，卷积神经网络（CNN）对于图像识别非常准确，甚至超过了人类识别能力。部署在电池供电的移动设备上时，需要有效的计算机架构来实现昂贵的卷积操作的快速和节能计算。尽管CNN的硬件加速器设计最近取得了进步，但尚未有效解决两个主要问题，尤其是当卷积层具有高度多样化的结构时：（1）最大程度地减少渴望渴望能量的芯片外DRAM数据运动； (2) maximizing the utilization factor of processing resources to perform convolutions.因此，这项工作提出了一个能节能的体系结构，该体系结构配备了几种优化的数据流，以支持现代CNN的结构多样性。 The proposed approach is evaluated by implementing convolutional layers of VGGNet-16 and ResNet-50.结果表明，该体系结构在大多数3x3和1x1卷积层中达到了98％的处理元件（PE）利用率，同时将延迟限制为396.9 ms和92.7 ms，分别执行VGGNET-16和RESNET-50的卷积层。此外，提出的架构受益于Resnet-50中的结构化稀疏性，在修剪一半的通道时将潜伏期降低到42.5 ms。

Convolutional Neural Networks (CNNs) have proven to be extremely accurate for image recognition, even outperforming human recognition capability. When deployed on battery-powered mobile devices, efficient computer architectures are required to enable fast and energy-efficient computation of costly convolution operations. Despite recent advances in hardware accelerator design for CNNs, two major problems have not yet been addressed effectively, particularly when the convolution layers have highly diverse structures: (1) minimizing energy-hungry off-chip DRAM data movements; (2) maximizing the utilization factor of processing resources to perform convolutions. This work thus proposes an energy-efficient architecture equipped with several optimized dataflows to support the structural diversity of modern CNNs. The proposed approach is evaluated by implementing convolutional layers of VGGNet-16 and ResNet-50. Results show that the architecture achieves a Processing Element (PE) utilization factor of 98% for the majority of 3x3 and 1x1 convolutional layers, while limiting latency to 396.9 ms and 92.7 ms when performing convolutional layers of VGGNet-16 and ResNet-50, respectively. In addition, the proposed architecture benefits from the structured sparsity in ResNet-50 to reduce the latency to 42.5 ms when half of the channels are pruned.

下载PDF全文

下载文献需遵守相关版权规定

论文标题