晶圆尺度快速傅立叶变换

论文标题

晶圆尺度快速傅立叶变换

Wafer-Scale Fast Fourier Transforms

论文作者

Orenes-Vera, Marcelo, Sharapov, Ilya, Schreiber, Robert, Jacquelin, Mathias, Vandermersch, Philippe, Chetlur, Sharan

论文摘要

我们已经在小脑CS-2上实现了一个，两个和三维阵列的快速傅立叶变换，该系统的内存和处理元素位于单个硅晶圆上。晶圆尺度发动机（WSE）涵盖了大约850,000个处理元件（PES）的二维网格，并具有快速的本地内存和同样快速的邻近互连。我们的晶圆尺度FFT（WSFFT）并行了一个$ n^3 $的问题，最多$ n^2 $ pes。此时，PE仅处理每个超级巨星的3D域（称为铅笔）的单个向量，其中三个超级巨星中的每一个都沿输入阵列的三个轴之一进行FFT。在超级巨星之间，WSFFT重新分布（转置）数据将每个一维铅笔的所有元素都转换为单个PE的记忆。每个重新分布都会沿着网格维度之一进行全面通信。给定平行性的水平，在PE之间传递的消息的大小可以像一个单词一样小。从理论上讲，由于其有限的一分配带宽，网格并不是全能通信的理想选择。但是，WSE上的网格互连PE完全在磁力上，即使有微小的消息，也几乎达到了带宽的峰值。高颗粒通信的高效率使WSFFT能够达到前所未有的并行性和性能水平。我们使用FP16和FP32精度分析了详细的计算和通信时间，以及弱缩放时间。在CS-2上，使用32位算术，我们使用512x512的512x512子网格实现了959微秒的3D fft，$ 512^3 $复杂的输入阵列。对于此问题大小，这是有史以来最大的并行化，也是打破毫秒障碍的第一个实现。

We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a $n^3$ problem with up to $n^2$ PEs. At this point a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. This high efficiency on fine-grain communication allow wsFFT to achieve unprecedented levels of parallelism and performance. We analyse in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a $512^3$ complex input array using a 512x512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.

下载PDF全文

下载文献需遵守相关版权规定

论文标题