论文标题
通过Polyak的动量对可证明加速的模块化分析:训练广泛的relu网络和深层线性网络
A Modular Analysis of Provable Acceleration via Polyak's Momentum: Training a Wide ReLU Network and a Deep Linear Network
论文作者
论文摘要
在神经净训练中广泛使用了所谓的“动量”动态,因为广泛观察到,至少从经验上讲,它通常会导致明显更快地收敛。同时,文献中很少有理论保证可以解释这种明显的加速效应。即使对于经典的强烈凸出二次问题,现有的几个结果也仅表明Polyak的势头渐近线性加速。在本文中,我们首先重新审视了二次问题,并显示了Polyak的动量的非质合加速线性速率。然后,我们证明,Polyak的动量达到了训练单层宽的relu网络和深层线性网络的加速度,这也许是研究文献中研究优化和深度学习的两个最受欢迎的规范模型。在Al的事先工作。 2019和Wu等。 2019年表明,使用香草梯度下降,并且在使用过度参数化的情况下,错误衰减为$(1-θ(\ frac {1} {κ'}))^t $后$ t $迭代后,其中$κ'$是gram矩阵的条件数量。我们的结果表明,有了适当的选择,Polyak的动量的速率为$(1-θ(\ frac {1} {\ sqrt {κ}}}))^T $。对于深部线性网络,先前的工作Hu等人。 2020年表明,香草梯度下降的速率为$(1-θ(\ frac {1}κ))^t $,其中$κ$是数据矩阵的条件数。我们的结果表明,加速率$(1-θ(\ frac {1} {\sqrtκ}))^t $是Polyak的动量可以实现的。这项工作中的所有结果都是从模块化分析中获得的,该分析可能具有独立感兴趣。这项工作确定了势头确实加快了神经网训练的速度。
Incorporating a so-called "momentum" dynamic in gradient descent methods is widely used in neural net training as it has been broadly observed that, at least empirically, it often leads to significantly faster convergence. At the same time, there are very few theoretical guarantees in the literature to explain this apparent acceleration effect. Even for the classical strongly convex quadratic problems, several existing results only show Polyak's momentum has an accelerated linear rate asymptotically. In this paper, we first revisit the quadratic problems and show a non-asymptotic accelerated linear rate of Polyak's momentum. Then, we provably show that Polyak's momentum achieves acceleration for training a one-layer wide ReLU network and a deep linear network, which are perhaps the two most popular canonical models for studying optimization and deep learning in the literature. Prior work Du at al. 2019 and Wu et al. 2019 showed that using vanilla gradient descent, and with an use of over-parameterization, the error decays as $(1- Θ(\frac{1}{ κ'}))^t$ after $t$ iterations, where $κ'$ is the condition number of a Gram Matrix. Our result shows that with the appropriate choice of parameters Polyak's momentum has a rate of $(1-Θ(\frac{1}{\sqrt{κ'}}))^t$. For the deep linear network, prior work Hu et al. 2020 showed that vanilla gradient descent has a rate of $(1-Θ(\frac{1}κ))^t$, where $κ$ is the condition number of a data matrix. Our result shows an acceleration rate $(1- Θ(\frac{1}{\sqrtκ}))^t$ is achievable by Polyak's momentum. All the results in this work are obtained from a modular analysis, which can be of independent interest. This work establishes that momentum does indeed speed up neural net training.