快照分散的随机梯度跟踪方法

论文标题

快照分散的随机梯度跟踪方法

Snap-Shot Decentralized Stochastic Gradient Tracking Methods

论文作者

Ye, Haishan, Chang, Xiangyu

论文摘要

在分散的优化中，$ m $代理形成网络，仅与邻居进行通信，这在数据所有权，隐私和可扩展性方面具有优势。同时，作为训练大规模机器学习模型的流行分散算法，分散的随机梯度下降（\ texttt {sgd}）方法表明，它们优于集中式同行。分布式随机梯度跟踪〜（\ texttt {dsgt}）〜\ citep {pu2021distribated}由于其适当的理论保证，因此被认为是流行和最先进的分散\ texttt {sgd}方法。但是，\ dsgt〜 \ citep {koloskova2021impreved}的理论分析表明，其迭代复杂性为$ \ tilde {\ natercal {o}} \ left（\ frac {\ frac {\barσ^2}} \frac{\sqrt{L}\barσ}{μ(1 - λ_2(W))^{1/2} C_W \sqrt{\varepsilon} }\right)$, where $W$ is a double stochastic mixing matrix that presents the network topology and $ C_W $ is a parameter that depends on $W$.因此，它表明\ texttt {dsgt}的收敛属性受到通信网络拓扑的严重影响。为了克服\ texttt {dsgt}的弱点，我们求助于快照梯度跟踪技能，并提出了两种新颖的算法。我们进一步证明，在相似的算法结构下，提出的两种算法对通信网络的拓扑更强大，并且与\ dsgt〜相同的通信策略。与\ dsgt相比，它们的迭代复杂性为$ \ Mathcal {o} \ left（\ frac {\barσ^2} {mμ\ varepsilon} + \ frac {\ sqrt {\ sqrt {l} \ right）$和$ \ Mathcal {o} \ left（\ frac {\barσ^2} {mμ\ varepsilon} + \ frac {\ sqrt {\ sqrt {l} \barσ}}} \ right）$减少对网络拓扑的影响（没有$ C_W $）。

In decentralized optimization, $m$ agents form a network and only communicate with their neighbors, which gives advantages in data ownership, privacy, and scalability. At the same time, decentralized stochastic gradient descent (\texttt{SGD}) methods, as popular decentralized algorithms for training large-scale machine learning models, have shown their superiority over centralized counterparts. Distributed stochastic gradient tracking~(\texttt{DSGT})~\citep{pu2021distributed} has been recognized as the popular and state-of-the-art decentralized \texttt{SGD} method due to its proper theoretical guarantees. However, the theoretical analysis of \dsgt~\citep{koloskova2021improved} shows that its iteration complexity is $\tilde{\mathcal{O}} \left(\frac{\barσ^2}{mμ\varepsilon} + \frac{\sqrt{L}\barσ}{μ(1 - λ_2(W))^{1/2} C_W \sqrt{\varepsilon} }\right)$, where $W$ is a double stochastic mixing matrix that presents the network topology and $ C_W $ is a parameter that depends on $W$. Thus, it indicates that the convergence property of \texttt{DSGT} is heavily affected by the topology of the communication network. To overcome the weakness of \texttt{DSGT}, we resort to the snap-shot gradient tracking skill and propose two novel algorithms. We further justify that the proposed two algorithms are more robust to the topology of communication networks under similar algorithmic structures and the same communication strategy to \dsgt~. Compared with \dsgt, their iteration complexity are $\mathcal{O}\left( \frac{\barσ^2}{mμ\varepsilon} + \frac{\sqrt{L}\barσ}{μ(1 - λ_2(W))\sqrt{\varepsilon}} \right)$ and $\mathcal{O}\left( \frac{\barσ^2}{mμ\varepsilon} + \frac{\sqrt{L}\barσ}{μ(1 - λ_2(W))^{1/2}\sqrt{\varepsilon}} \right)$ which reduce the impact on network topology (no $C_W$).

下载PDF全文

下载文献需遵守相关版权规定

论文标题