闲置波，脱离和瓶颈逃避在平行程序表现中的作用

论文标题

闲置波，脱离和瓶颈逃避在平行程序表现中的作用

The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs

论文作者

Afzal, Ayesha, Hager, Georg, Wellein, Gerhard

论文摘要

高度平行应用在分布式内存系统上的性能受许多因素的影响。分析性能建模技术旨在提供对绩效限制的见解，并且通常是优化工作的起点。但是，整个系统层次结构（套接字，节点，网络）之间的分析模型未能包含程序代码和硬件之间的复杂相互作用，尤其是在涉及执行和通信瓶颈时。在本文中，我们研究了“瓶颈逃避”的效果，以及它如何导致通信开销与计算的自动重叠。瓶颈逃避会导致平行代码的初始散装同步行为逐渐损失，从而使其过程变得不同步。这是在内存的程序中最突出发生的，这就是为什么我们选择基于内存的基准和应用程序代码，特别是MPI-EAGMENT的流三合会，稀疏矩阵矢量乘法以及集体避免的Chebyshev过滤对角色代码，以证明DesynChronization对两个不同的超级级别平台的后果。我们研究了怠速波作为可能的触发器的作用，并显示了自动异步通信对代码属性和参数频谱的影响，例如饱和点，矩阵结构，域分解和通信汇总。我们的发现揭示了消除同步点（例如集体沟通或障碍）如何通过简单地从整体运行时减去集体的开销来促进性能的改进。

The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often the starting point of optimization efforts. However, coupling analytic models across the system hierarchy (socket, node, network) fails to encompass the intricate interplay between the program code and the hardware, especially when execution and communication bottlenecks are involved. In this paper we investigate the effect of "bottleneck evasion" and how it can lead to automatic overlap of communication overhead with computation. Bottleneck evasion leads to a gradual loss of the initial bulk-synchronous behavior of a parallel code so that its processes become desynchronized. This occurs most prominently in memory-bound programs, which is why we choose memory-bound benchmark and application codes, specifically an MPI-augmented STREAM Triad, sparse matrix-vector multiplication, and a collective-avoiding Chebyshev filter diagonalization code to demonstrate the consequences of desynchronization on two different supercomputing platforms. We investigate the role of idle waves as possible triggers for desynchronization and show the impact of automatic asynchronous communication for a spectrum of code properties and parameters, such as saturation point, matrix structures, domain decomposition, and communication concurrency. Our findings reveal how eliminating synchronization points (such as collective communication or barriers) precipitates performance improvements that go beyond what can be expected by simply subtracting the overhead of the collective from the overall runtime.

下载PDF全文

下载文献需遵守相关版权规定

论文标题