在线广告的沟通高效的Terabyte尺度模型培训框架

论文标题

在线广告的沟通高效的Terabyte尺度模型培训框架

Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising

论文作者

Zhao, Weijie, Jiao, Xuewu, Hu, Mingqing, Li, Xiaoyun, Zhang, Xiangyu, Li, Ping

论文摘要

点击率（CTR）预测是在线广告行业的关键组成部分。为了产生个性化的CTR预测，一个行业级别的CTR预测模型通常以高维（例如100或1000亿个功能）稀疏向量（根据查询关键字，用户肖像等编码）作为输入。结果，该模型需要Terabyte量表参数嵌入高维输入。已经提出了层次分布式GPU参数服务器，以通过有限的内存启用GPU，以通过利用CPU主内存和SSD作为辅助存储来训练大规模网络。我们在现有的大规模广告模型的现有GPU培训框架中确定了两个主要挑战，并提出了一系列优化来应对这些挑战：（a）GPU，CPU，SSD在培训期间迅速相互交流。由于硬件拓扑，GPU和CPU之间的连接是不均匀的。数据通信路由应根据硬件拓扑优化；（b）不同计算节点中的GPU经常通信以同步参数。我们需要优化通信，以便分布式系统可以扩展。在本文中，我们提出了一个硬件感知的培训工作流程，将硬件拓扑结合到算法设计中。为了减少计算节点之间的广泛沟通，我们引入了$ k $步骤的模型合并流行的Adam Optimizer算法，并在非Convex优化方面提供了收敛速度。据我们所知，这是$ k $步骤自适应优化方法在工业级别CTR模型培训中的第一个应用。现实世界数据上的数值结果证实，优化的系统设计大大减少了大型模型的训练时间，而准确性基本上没有损失。

Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry. In order to produce a personalized CTR prediction, an industry-level CTR prediction model commonly takes a high-dimensional (e.g., 100 or 1000 billions of features) sparse vector (that is encoded from query keywords, user portraits, etc.) as input. As a result, the model requires Terabyte scale parameters to embed the high-dimensional input. Hierarchical distributed GPU parameter server has been proposed to enable GPU with limited memory to train the massive network by leveraging CPU main memory and SSDs as secondary storage. We identify two major challenges in the existing GPU training framework for massive-scale ad models and propose a collection of optimizations to tackle these challenges: (a) the GPU, CPU, SSD rapidly communicate with each other during the training. The connections between GPUs and CPUs are non-uniform due to the hardware topology. The data communication route should be optimized according to the hardware topology; (b) GPUs in different computing nodes frequently communicates to synchronize parameters. We are required to optimize the communications so that the distributed system can become scalable. In this paper, we propose a hardware-aware training workflow that couples the hardware topology into the algorithm design. To reduce the extensive communication between computing nodes, we introduce a $k$-step model merging algorithm for the popular Adam optimizer and provide its convergence rate in non-convex optimization. To the best of our knowledge, this is the first application of $k$-step adaptive optimization method in industrial-level CTR model training. The numerical results on real-world data confirm that the optimized system design considerably reduces the training time of the massive model, with essentially no loss in accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题