量化自适应亚级别算法及其应用

论文标题

量化自适应亚级别算法及其应用

Quantized Adaptive Subgradient Algorithms and Their Applications

论文作者

Xu, Ke, Wangni, Jianqiao, Zhang, Yifan, Ye, Deheng, Wu, Jiaxiang, Zhao, Peilin

论文摘要

数据爆炸和模型尺寸的增加驱动了大规模机器学习的显着进步，但也使模型训练时间耗时和模型存储变得困难。为了解决具有高计算效率和设备限制较小的分布式模型培训设置中的上述问题，仍然存在两个主要困难。一方面，交换信息的沟通成本，例如，不同工人之间的随机梯度是分布式培训效率的关键瓶颈。另一方面，较少的参数模型容易用于存储和通信，但是损害模型性能的风险。为了同时平衡通信成本，模型容量和模型性能，我们提出了量化的复合镜下降自适应亚基（QCMD ADAGRAD），并量化正规化双平均平均自适应亚占用（QRDA Adagrad）进行分布式培训。要具体而言，我们探讨了梯度量化和稀疏模型的组合，以降低分布式培训中每迭代的通信成本。构建了基于量化梯度的自适应学习率矩阵，以在通信成本，准确性和模型稀疏性之间达到平衡。此外，从理论上讲，我们发现大量化误差会引起额外的噪声，从而影响模型的收敛性和稀疏性。因此，在QCMD Adagrad和QRDA Adagrad中采用了具有相对较小误差的阈值量化策略，以提高信噪比并保留模型的稀疏性。理论分析和经验结果都证明了所提出算法的功效和效率。

Data explosion and an increase in model size drive the remarkable advances in large-scale machine learning, but also make model training time-consuming and model storage difficult. To address the above issues in the distributed model training setting which has high computation efficiency and less device limitation, there are still two main difficulties. On one hand, the communication costs for exchanging information, e.g., stochastic gradients among different workers, is a key bottleneck for distributed training efficiency. On the other hand, less parameter model is easy for storage and communication, but the risk of damaging the model performance. To balance the communication costs, model capacity and model performance simultaneously, we propose quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual average adaptive subgradient (QRDA adagrad) for distributed training. To be specific, we explore the combination of gradient quantization and sparse model to reduce the communication cost per iteration in distributed training. A quantized gradient-based adaptive learning rate matrix is constructed to achieve a balance between communication costs, accuracy, and model sparsity. Moreover, we theoretically find that a large quantization error brings in extra noise, which influences the convergence and sparsity of the model. Therefore, a threshold quantization strategy with a relatively small error is adopted in QCMD adagrad and QRDA adagrad to improve the signal-to-noise ratio and preserve the sparsity of the model. Both theoretical analyses and empirical results demonstrate the efficacy and efficiency of the proposed algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题