关于训练深线性重新连接的全球融合

论文标题

关于训练深线性重新连接的全球融合

On the Global Convergence of Training Deep Linear ResNets

论文作者

Zou, Difan, Long, Philip M., Gu, Quanquan

论文摘要

我们研究了用于训练$ L $ -HIDDER-HIDDER-HIDDER-HIDDER线性残留网络（RESNETS）的梯度下降（GD）和随机梯度下降（SGD）的收敛性。我们证明，对于在输入和输出层上具有某些线性变换的深层剩余网络，这些层均固定在整个训练中，GD和SGD均以所有隐藏重量的初始化为零的初始化都可以融合到训练损失的全球最小值。此外，当专门研究适当的高斯随机线性变换时，GD和SGD可证明可以优化足够的深线性重新连接。与训练标准深线性网络GD的全球融合结果相比，我们对神经网络宽度的状况更加明显$ O（κL）$，其中$κ$表示培训数据的协调矩阵的状况数量。我们进一步提出了修改的身份输入和输出转换，并表明$（d+k）$ - 宽的神经网络足以保证GD/SGD的全局收敛性，其中$ d，k $是输入和输出维度。

We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are fixed throughout training, both GD and SGD with zero initialization on all hidden weights can converge to the global minimum of the training loss. Moreover, when specializing to appropriate Gaussian random linear transformations, GD and SGD provably optimize wide enough deep linear ResNets. Compared with the global convergence result of GD for training standard deep linear networks (Du & Hu 2019), our condition on the neural network width is sharper by a factor of $O(κL)$, where $κ$ denotes the condition number of the covariance matrix of the training data. We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d,k$ are the input and output dimensions respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题