论文标题
异质CPU+GPU随机梯度下降算法
Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms
论文作者
论文摘要
广泛的实践是,由于它们在线性代数操作上的出色表现,因此使用专门的硬件加速器(例如GPU或TPU)训练深度学习模型。但是,默认情况下,该策略并未有效采用广泛的CPU和内存资源(仅用于预处理,数据传输和调度)。在本文中,我们研究了培训算法,以深入学习异质CPU+GPU体系结构。我们的两个目标 - 同时提高收敛速率和资源利用率 - 使问题具有挑战性。为了允许对设计空间进行原则探索,我们首先引入了一个通用的深度学习框架,该框架利用了CPU和GPU之间的计算能力和内存层次结构的差异,通过异步消息传递。基于通过框架实验获得的见解,我们设计了两个异质异步随机梯度下降(SGD)算法。第一种算法-CPU+GPU Hogbatch-将CPU上的小批量与GPU上的大批批量结合在一起,以最大程度地利用这两种资源。但是,这会产生不平衡的模型更新分布,从而阻碍统计收敛。第二种算法 - 自适应hogbatch-根据CPU和GPU的相对速度分配了批次不断发展的批次。这可以平衡模型更新比率,但以可自定义的利用率下降为代价。我们表明,在拟议的CPU+GPU框架中实现这些算法的实现比在几个真实数据集上的TensorFlow以及两个计算体系结构(一个在本地服务器和云实例)上实现了更快的收敛性和更高的资源利用率。
The widely-adopted practice is to train deep learning models with specialized hardware accelerators, e.g., GPUs or TPUs, due to their superior performance on linear algebra operations. However, this strategy does not employ effectively the extensive CPU and memory resources -- which are used only for preprocessing, data transfer, and scheduling -- available by default on the accelerated servers. In this paper, we study training algorithms for deep learning on heterogeneous CPU+GPU architectures. Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging. In order to allow for a principled exploration of the design space, we first introduce a generic deep learning framework that exploits the difference in computational power and memory hierarchy between CPU and GPU through asynchronous message passing. Based on insights gained through experimentation with the framework, we design two heterogeneous asynchronous stochastic gradient descent (SGD) algorithms. The first algorithm -- CPU+GPU Hogbatch -- combines small batches on CPU with large batches on GPU in order to maximize the utilization of both resources. However, this generates an unbalanced model update distribution which hinders the statistical convergence. The second algorithm -- Adaptive Hogbatch -- assigns batches with continuously evolving size based on the relative speed of CPU and GPU. This balances the model updates ratio at the expense of a customizable decrease in utilization. We show that the implementation of these algorithms in the proposed CPU+GPU framework achieves both faster convergence and higher resource utilization than TensorFlow on several real datasets and on two computing architectures -- an on-premises server and a cloud instance.