论文标题
在粗粒子的可重新配置空间建筑上映射模具
Mapping Stencils on Coarse-grained Reconfigurable Spatial Architecture
论文作者
论文摘要
模板代表一类计算模式,其中输出网格点取决于输入网格中相邻点的固定形状。模具计算在科学应用中普遍存在,使得超过超级计算资源的很大一部分。因此,优化模具程序以获得最佳性能一直很重要。大量研究的重点是对几乎所有平行体系结构的模板计算进行优化。模具应用具有定期的依赖模式,固有的管道并行性和大量数据重用。这使这些应用程序非常适合可粗粒的可重新配置空间体系结构(CGRA)。 CGRA由许多与片网络连接的简单小处理元素(PE)组成。可以将每个PE配置为执行模板计算的一部分,并且所有PE都并联运行;还可以配置网络,以便可以将加载的数据从PE传递到邻居PE,从而被许多PE重复使用而无需寄存器溢出和内存流量。如何有效地将模板计算映射到CGRA是性能的关键。在本文中,我们展示了一些独特且可推广的方法,将一单模型计算映射到CGRA,充分利用数据重用机会和并行性。我们的仿真实验表明,这些映射是有效的,并使CGRA能够胜过最先进的GPU。
Stencils represent a class of computational patterns where an output grid point depends on a fixed shape of neighboring points in an input grid. Stencil computations are prevalent in scientific applications engaging a significant portion of supercomputing resources. Therefore, it has been always important to optimize stencil programs for the best performance. A rich body of research has focused on optimizing stencil computations on almost all parallel architectures. Stencil applications have regular dependency patterns, inherent pipeline-parallelism, and plenty of data reuse. This makes these applications a perfect match for a coarse-grained reconfigurable spatial architecture (CGRA). A CGRA consists of many simple, small processing elements (PEs) connected with an on-chip network. Each PE can be configured to execute part of a stencil computation and all PEs run in parallel; the network can also be configured so that data loaded can be passed from a PE to a neighbor PE directly and thus reused by many PEs without register spilling and memory traffic. How to efficiently map a stencil computation to a CGRA is the key to performance. In this paper, we show a few unique and generalizable ways of mapping one- and multidimensional stencil computations to a CGRA, fully exploiting the data reuse opportunities and parallelism. Our simulation experiments demonstrate that these mappings are efficient and enable the CGRA to outperform state-of-the-art GPUs.