在粗粒子的可重新配置空间建筑上映射模具

论文标题

在粗粒子的可重新配置空间建筑上映射模具

Mapping Stencils on Coarse-grained Reconfigurable Spatial Architecture

论文作者

Tithi, Jesmin Jahan, Petrini, Fabrizio, Rong, Hongbo, Valentin, Andrei, Ebeling, Carl

论文摘要

模板代表一类计算模式，其中输出网格点取决于输入网格中相邻点的固定形状。模具计算在科学应用中普遍存在，使得超过超级计算资源的很大一部分。因此，优化模具程序以获得最佳性能一直很重要。大量研究的重点是对几乎所有平行体系结构的模板计算进行优化。模具应用具有定期的依赖模式，固有的管道并行性和大量数据重用。这使这些应用程序非常适合可粗粒的可重新配置空间体系结构（CGRA）。 CGRA由许多与片网络连接的简单小处理元素（PE）组成。可以将每个PE配置为执行模板计算的一部分，并且所有PE都并联运行；还可以配置网络，以便可以将加载的数据从PE传递到邻居PE，从而被许多PE重复使用而无需寄存器溢出和内存流量。如何有效地将模板计算映射到CGRA是性能的关键。在本文中，我们展示了一些独特且可推广的方法，将一单模型计算映射到CGRA，充分利用数据重用机会和并行性。我们的仿真实验表明，这些映射是有效的，并使CGRA能够胜过最先进的GPU。

Stencils represent a class of computational patterns where an output grid point depends on a fixed shape of neighboring points in an input grid. Stencil computations are prevalent in scientific applications engaging a significant portion of supercomputing resources. Therefore, it has been always important to optimize stencil programs for the best performance. A rich body of research has focused on optimizing stencil computations on almost all parallel architectures. Stencil applications have regular dependency patterns, inherent pipeline-parallelism, and plenty of data reuse. This makes these applications a perfect match for a coarse-grained reconfigurable spatial architecture (CGRA). A CGRA consists of many simple, small processing elements (PEs) connected with an on-chip network. Each PE can be configured to execute part of a stencil computation and all PEs run in parallel; the network can also be configured so that data loaded can be passed from a PE to a neighbor PE directly and thus reused by many PEs without register spilling and memory traffic. How to efficiently map a stencil computation to a CGRA is the key to performance. In this paper, we show a few unique and generalizable ways of mapping one- and multidimensional stencil computations to a CGRA, fully exploiting the data reuse opportunities and parallelism. Our simulation experiments demonstrate that these mappings are efficient and enable the CGRA to outperform state-of-the-art GPUs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题