交叉SGD：一种基于八卦的沟通在分布式深度学习中，可减轻大型迷你批量问题并提高可扩展性

论文标题

交叉SGD：一种基于八卦的沟通在分布式深度学习中，可减轻大型迷你批量问题并提高可扩展性

Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability

论文作者

Yeo, Sangho, Bae, Minho, Jeong, Minjoong, Kwon, Oh-kyoung, Oh, Sangyoon

论文摘要

分布式深度学习是减少大型数据集深度学习和复杂模型的训练时间的有效方法。但是，网络开销引起的可伸缩性有限，因此很难同步所有工人的参数。为了解决这个问题，无论已经提出了工人的数量，都证明了基于八卦的方法，这些方法证明了稳定的可伸缩性。但是，要在一般情况下使用基于八卦的方法，需要验证大型迷你批次的验证精度。为了验证这一点，我们首先在大型迷你批处理问题中首先凭经验研究八卦方法的特征，并观察到，当批处理数量增加并且固定工人的数量时，八卦方法比Allreduce-SGD（随机梯度下降）保持更高的验证精度（随机梯度下降）。但是，基于八卦的模型的延迟参数传播降低了大节点尺度的验证精度。为了解决这个问题，我们提出了通过细分沟通和负载平衡随机网络拓扑来减轻权重参数延迟传播的交叉SGD。我们还调整了分层通信，以限制基于八卦的通信方法中的工人数量。为了验证我们提出的方法的有效性，我们进行了经验实验，并观察到我们的交叉SGD比SGP（随机梯度推动）显示出更高的节点可伸缩性。

Distributed deep learning is an effective way to reduce the training time of deep learning for large datasets as well as complex models. However, the limited scalability caused by network overheads makes it difficult to synchronize the parameters of all workers. To resolve this problem, gossip-based methods that demonstrates stable scalability regardless of the number of workers have been proposed. However, to use gossip-based methods in general cases, the validation accuracy for a large mini-batch needs to be verified. To verify this, we first empirically study the characteristics of gossip methods in a large mini-batch problem and observe that the gossip methods preserve higher validation accuracy than AllReduce-SGD(Stochastic Gradient Descent) when the number of batch sizes is increased and the number of workers is fixed. However, the delayed parameter propagation of the gossip-based models decreases validation accuracy in large node scales. To cope with this problem, we propose Crossover-SGD that alleviates the delay propagation of weight parameters via segment-wise communication and load balancing random network topology. We also adapt hierarchical communication to limit the number of workers in gossip-based communication methods. To validate the effectiveness of our proposed method, we conduct empirical experiments and observe that our Crossover-SGD shows higher node scalability than SGP(Stochastic Gradient Push).

下载PDF全文

下载文献需遵守相关版权规定

论文标题