论文标题
通过分布式矩阵乘法的不平等误差保护缓解Straggler
Straggler Mitigation through Unequal Error Protection for Distributed Matrix Multiplication
论文作者
论文摘要
大规模的机器学习和数据挖掘方法通常会在多个代理上分发计算以并行化处理。代理商在代理商处所需的时间受到本地资源的可用性的影响,这导致了“ Straggler问题”,在这种情况下,计算结果被无响应的代理抑制了。对于这个问题,可以使用基质子块的线性编码来引入弹性的弹性。参数服务器(PS)使用通道代码,并将矩阵分配给工人以进行乘法。然后,它使用给定截止日期收到的计算结果与所需的矩阵乘法产生近似值。在本文中,我们建议采用不平等的错误保护(UEP)代码来减轻Straggler问题。每个子块的弹性水平根据其规范选择,因为具有较大规范的块对矩阵乘法的结果具有更高的影响。我们通过理论和数值评估验证了我们计划的有效性。我们使用随机线性代码对UEP的性能得出理论表征,并比较它相等的误差保护情况。我们还将提出的编码策略应用于训练深神经网络(DNN)的后传播步骤的计算,为此我们研究了精度与计算所需的时间之间的基本权衡。
Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for computation at the agents is affected by the availability of local resources giving rise to the "straggler problem" in which the computation results are held back by unresponsive agents. For this problem, linear coding of the matrix sub-blocks can be used to introduce resilience toward straggling. The Parameter Server (PS) utilizes a channel code and distributes the matrices to the workers for multiplication. It then produces an approximation to the desired matrix multiplication using the results of the computations received at a given deadline. In this paper, we propose to employ Unequal Error Protection (UEP) codes to alleviate the straggler problem. The resiliency level of each sub-block is chosen according to its norm as blocks with larger norms have higher effects on the result of the matrix multiplication. We validate the effectiveness of our scheme both theoretically and through numerical evaluations. We derive a theoretical characterization of the performance of UEP using random linear codes, and compare it the case of equal error protection. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN), for which we investigate the fundamental trade-off between precision and the time required for the computations.