通过可编程开关启用快速，灵活的分布式深度学习

论文标题

通过可编程开关启用快速，灵活的分布式深度学习

Enabling Fast and Flexible Distributed Deep Learning with Programmable Switches

论文作者

Pan, Heng, Cui, Penglai, li, Zhenyu, Jia, Ru, Zhang, Penghao, Zhang, Leilei, Yang, Ye, Wu, Jiahao, Dong, Jianbo, Cao, Zheng, Li, Qiang, Liu, Hongqiang Harry, Laurent, Mathy, Xie, Gaogang

论文摘要

深度学习已在广泛的领域中使用，并取得了巨大的突破。随着模型尺寸和训练数据量的不断增加，出现了分布式深度学习，它利用集群并行训练模型。不幸的是，由于群集节点之间的通信开销，该性能通常远非线性加速。为了应对这一挑战，本文设计和实施了网络聚合器的天秤座，该网络聚合器利用网络内计算在两个方面优化分布式DL培训的通信：1）减少主动连接，2）聚合交换的网络数据包。我们在Intel Tofino交换机上实施了天秤座，定制了轻巧的主机堆栈，并将其集成到开源培训框架PS-Lite中。实验结果表明，我们的天秤座可以达到1.5〜4倍的速度。

Deep learning has been used in a wide range of areas and made a huge breakthrough. With the ever-increasing model size and train-ing data volume, distributed deep learning emerges which utilizes a cluster to train a model in parallel. Unfortunately, the performance is often far from linear speedup due to the communication overhead between cluster nodes. To address this challenge, this paper designs and implements Libra, a network aggregator, that utilizes in-network computation to optimize the communication for distributed DL training in two aspects: 1) reduce active connections and 2) aggregate exchanged network packets. We implemented our Libra on Intel Tofino switches, customized a lightweight host stack and integrated it into an open-source training framework PS-lite. The experimental result shows that our Libra can achieve 1.5~4 times speedup.

下载PDF全文

下载文献需遵守相关版权规定

论文标题