通过跨范式进行整体分析，对DNN加速器的新观点

论文标题

通过跨范式进行整体分析，对DNN加速器的新观点

A Fresh Perspective on DNN Accelerators by Performing Holistic Analysis Across Paradigms

论文作者

Glint, Tom, Jha, Chandan Kumar, Awasthi, Manu, Mekie, Joycee

论文摘要

具有von Neumann架构的传统计算机无法满足深神经网络（DNN）工作量的潜伏期和可扩展性挑战。已经提出了基于常规计算硬件加速器（CHA），近数据加工（NDP）和内存处理（PIM）范式的各种DNN加速器，以应对这些挑战。我们在这项工作中的目标是对DNN加速器范式的最先进的加速器进行严格的比较，我们使用Mobilenet，Resnet，Bert和Mlperf推理基准的DLRM的独特层来进行分析。详细的模型基于硬件实现的最新设计。我们观察到，对于内存密集的完全连接层（FCL）DNNS，基于NDP的加速器比最先进的CHA快10.6倍，比基于PIM的加速器快39.9倍。对于计算密集型图像分类和对象检测DNN，最新的CHA比NDP快约10倍，比基于PIM的加速器快〜2000x。基于PIM的加速器适用于DNN应用，其中能量是约束的（CNN和FCL应用的分别比传统的ASIC系统低约2.7倍和〜21倍）。此外，我们确定了架构变化（例如增加存储器带宽，缓冲区重组），这些变化可以增加吞吐量（直至线性增加）和ML应用的能量（直至线性减小），并具有对CHA，NDP和PIM基促进剂中相关组件的详细灵敏度分析。

Traditional computers with von Neumann architecture are unable to meet the latency and scalability challenges of Deep Neural Network (DNN) workloads. Various DNN accelerators based on Conventional compute Hardware Accelerator (CHA), Near-Data-Processing (NDP) and Processing-in-Memory (PIM) paradigms have been proposed to meet these challenges. Our goal in this work is to perform a rigorous comparison among the state-of-the-art accelerators from DNN accelerator paradigms, we have used unique layers from MobileNet, ResNet, BERT, and DLRM of MLPerf Inference benchmark for our analysis. The detailed models are based on hardware-realized state-of-the art designs. We observe that for memory-intensive Fully Connected Layer (FCL) DNNs, NDP based accelerator is 10.6x faster than the state-of-the-art CHA and 39.9x faster than PIM based accelerator for inferencing. For compute-intensive image classification and object detection DNNs, the state-of-the-art CHA is ~10x faster than NDP and ~2000x faster than the PIM-based accelerator for inferencing. PIM-based accelerators are suitable for DNN applications where energy is a constraint (~2.7x and ~21x lower energy for CNN and FCL applications, respectively, than conventional ASIC systems). Further, we identify architectural changes (such as increasing memory bandwidth, buffer reorganization) that can increase throughput (up to linear increase) and lower energy (up to linear decrease) for ML applications with a detailed sensitivity analysis of relevant components in CHA, NDP and PIM based accelerators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题