Easyscale：精确的一致性弹性培训，以进行深度学习

论文标题

Easyscale：精确的一致性弹性培训，以进行深度学习

EasyScale: Accuracy-consistent Elastic Training for Deep Learning

论文作者

Li, Mingzhen, Xiao, Wencong, Sun, Biao, Zhao, Hanyu, Yang, Hailong, Ren, Shiru, Luan, Zhongzhi, Jia, Xianyan, Liu, Yi, Li, Yong, Lin, Wei, Qian, Depei

论文摘要

分布式同步GPU培训通常用于深度学习。使用固定数量的GPU的资源限制使大规模的培训作业遭受了很长的排队时间进行资源分配的时间，并降低了群集利用率。适应资源弹性可以减轻这种情况，但由于缺乏从资源分配中解除模型培训程序的能力，通常会引入不一致的模型精度。我们提出了Easyscale，这是一种弹性训练系统，可在均质和异质GPU的资源弹性下实现一致的模型准确性。 Easyscale严格保留了数据并行训练行为，仔细追踪了与一致性相关的因素，利用了EasyScaleThreadRead抽象和快速上下文转换的深度学习特征。为了利用异质群集，Easyscale根据内部/副JOB调度程序动态分配工人，最大程度地减少负载不平衡并最大化聚合的工作吞吐量。 Easyscale部署在一个在线服务集群中，为培训工作提供了机会，以机会性地利用闲置的GPU，将整体集群利用提高了62.1％。

Distributed synchronized GPU training is commonly used for deep learning. The resource constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long queuing time for resource allocation, and lowers the cluster utilization. Adapting to resource elasticity can alleviate this but often introduces inconsistent model accuracy, due to lacking of capability to decouple model training procedure from resource allocation. We propose EasyScale, an elastic training system that achieves consistent model accuracy under resource elasticity for both homogeneous and heterogeneous GPUs. EasyScale preserves the data-parallel training behaviors strictly, traces the consistency-relevant factors carefully, utilizes the deep learning characteristics for EasyScaleThread abstraction and fast context-switching. To utilize heterogeneous cluster, EasyScale dynamically assigns workers based on the intra-/inter-job schedulers, minimizing load imbalance and maximizing aggregated job throughput. Deployed in an online serving cluster, EasyScale powers the training jobs to utilize idle GPUs opportunistically, improving overall cluster utilization by 62.1%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题