论文标题
关于随机优化的MSGD和Adagrad的收敛性
On the Convergence of mSGD and AdaGrad for Stochastic Optimization
论文作者
论文摘要
作为最基本的随机优化算法之一,在过去十年中,随机梯度下降(SGD)已被深入开发和广泛地应用于机器学习中。有一些修改后的SGD型算法,在收敛速度和准确性方面,在许多竞争和应用中,它们的表现都优于SGD,例如基于动量的SGD(MSGD)和适应性梯度算法(Adagrad)。尽管取得了这些经验成功,但由于技术困难,这些算法的理论特性并未得到很好的确定。通过这种动机,我们将重点放在MSGD和Adagrad的收敛分析上,以进行随机优化中的任何平滑(可能是非凸)损失函数。首先,我们证明,MSGD的迭代渐近地收敛到具有概率1的连接的一组固定点,这比现有的在子序列收敛或时间平均收敛性方面更一般。此外,我们证明,MSGD的损失函数以一定速度衰减的速度快于SGD。此外,我们证明Adagrad的迭代渐近地收敛到具有概率一个的固定点。同样,该结果扩展了有关子序列收敛和时间平均收敛性的结果。尽管上述收敛结果的普遍性,但我们还是放宽了一些梯度噪声,损失函数的凸度以及迭代术的界限的假设。
As one of the most fundamental stochastic optimization algorithms, stochastic gradient descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade. There have been some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient algorithm (AdaGrad). Despite these empirical successes, the theoretical properties of these algorithms have not been well established due to technical difficulties. With this motivation, we focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-convex) loss functions in stochastic optimization. First, we prove that the iterates of mSGD are asymptotically convergent to a connected set of stationary points with probability one, which is more general than existing works on subsequence convergence or convergence of time averages. Moreover, we prove that the loss function of mSGD decays at a certain rate faster than that of SGD. In addition, we prove the iterates of AdaGrad are asymptotically convergent to a connected set of stationary points with probability one. Also, this result extends the results from the literature on subsequence convergence and the convergence of time averages. Despite the generality of the above convergence results, we have relaxed some assumptions of gradient noises, convexity of loss functions, as well as boundedness of iterates.