论文标题
深度恢复网络中最大的初始学习率
Maximal Initial Learning Rates in Deep ReLU Networks
论文作者
论文摘要
培训神经网络需要选择合适的学习率,这涉及速度和融合有效性之间的权衡。尽管对学习率的大小进行了相当大的理论和经验分析,但大多数先前的工作仅着眼于后期培训。在这项工作中,我们介绍了最大的初始学习率$η^{\ ast} $ - 最大的学习率,随机初始化的神经网络可以成功开始训练并(至少)(至少)给定的阈值准确性。使用一种简单的方法来估计$η^{\ ast} $,我们观察到,在恒定宽度完全连接的relu网络中,$η^{\ ast} $的行为与以后培训中的最大学习率不同。具体来说,我们发现$η^{\ ast} $被很好地预测为深度$ \ times $宽度的力量,前提是(i)(i)与深度相比,网络的宽度足够大,并且(ii)输入层以相对较小的学习率进行训练。我们进一步分析了$η^{\ ast} $与网络初始化网络的清晰度$λ_{1} $之间的关系,这表明它们是紧密相关的,尽管无关紧要。我们正式证明了$λ_{1} $的界限,以深度$ \ times $宽度与我们的经验结果保持一致。
Training a neural network requires choosing a suitable learning rate, which involves a trade-off between speed and effectiveness of convergence. While there has been considerable theoretical and empirical analysis of how large the learning rate can be, most prior work focuses only on late-stage training. In this work, we introduce the maximal initial learning rate $η^{\ast}$ - the largest learning rate at which a randomly initialized neural network can successfully begin training and achieve (at least) a given threshold accuracy. Using a simple approach to estimate $η^{\ast}$, we observe that in constant-width fully-connected ReLU networks, $η^{\ast}$ behaves differently from the maximum learning rate later in training. Specifically, we find that $η^{\ast}$ is well predicted as a power of depth $\times$ width, provided that (i) the width of the network is sufficiently large compared to the depth, and (ii) the input layer is trained at a relatively small learning rate. We further analyze the relationship between $η^{\ast}$ and the sharpness $λ_{1}$ of the network at initialization, indicating they are closely though not inversely related. We formally prove bounds for $λ_{1}$ in terms of depth $\times$ width that align with our empirical results.