通过异质性启用稀疏张量加速度的灵活性

论文标题

通过异质性启用稀疏张量加速度的灵活性

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

论文作者

Qin, Eric, Garg, Raveesh, Bambhaniya, Abhimanyu, Pellauer, Michael, Parashar, Angshuman, Rajamanickam, Sivasankaran, Hao, Cong, Krishna, Tushar

论文摘要

最近，已经提出了许多用于深神经网络（DNN），图神经网络（GNN）和科学计算应用程序的稀疏硬件加速器。所有这些加速器中的一个共同特征是它们靶向张量代数（通常是矩阵乘法）。然而，为每个新应用程序提出了数十个新的加速器。动机是工作负载的大小和稀疏性在很大程度上影响哪种体系结构最适合记忆和计算效率。为了满足大型数据中心上一系列工作负载的有效计算需求不断增长的需求，我们建议部署一个灵活的“异质”加速器，其中包含许多“子加速器”（较小的专业加速器）一起工作。为此，我们提出：（1）Hard Taco，一种快速而有效的C ++，以生成许多类型的子加速器，以进行稀疏和密集的计算，以进行公平的设计空间探索，（2）AESPA，AESPA，一种异质的稀疏加速器设计模板，该模板是根据sub-acceler构建的，该模型是从硬化的策略中构建的（3）taco，and（3）A的策略（3）taco，and（3）内核在具有高效率和利用率的异质稀疏加速器上。具有优化调度的AESPA比具有多样化的工作负载套件的同质EIE般的加速器比同质的EIE般的加速器提高了1.96倍的性能和7.9倍的能量延迟产品（EDP）。

Recently, numerous sparse hardware accelerators for Deep Neural Networks (DNNs), Graph Neural Networks (GNNs), and scientific computing applications have been proposed. A common characteristic among all of these accelerators is that they target tensor algebra (typically matrix multiplications); yet dozens of new accelerators are proposed for every new application. The motivation is that the size and sparsity of the workloads heavily influence which architecture is best for memory and computation efficiency. To satisfy the growing demand of efficient computations across a spectrum of workloads on large data centers, we propose deploying a flexible 'heterogeneous' accelerator, which contains many 'sub-accelerators' (smaller specialized accelerators) working together. To this end, we propose: (1) HARD TACO, a quick and productive C++ to RTL design flow to generate many types of sub-accelerators for sparse and dense computations for fair design-space exploration, (2) AESPA, a heterogeneous sparse accelerator design template constructed with the sub-accelerators generated from HARD TACO, and (3) a suite of scheduling strategies to map tensor kernels onto heterogeneous sparse accelerators with high efficiency and utilization. AESPA with optimized scheduling achieves 1.96X higher performance, and 7.9X better energy-delay product (EDP) than a Homogeneous EIE-like accelerator with our diverse workload suite.

下载PDF全文

下载文献需遵守相关版权规定

论文标题