论文标题
培训具有梯度下降的两层恢复网络是不一致的
Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent
论文作者
论文摘要
我们证明,由例如He等人提出的广泛使用的方法。 (2015年)并在最小二乘损失上使用梯度下降进行了训练并不普遍一致。具体而言,我们描述了一大批一维数据生成的分布,因此,梯度下降的可能性很高,只能发现优化景观的局部最小值不好,因为它无法将偏见远离其初始化时的偏见。事实证明,在这些情况下,即使目标函数是非线性的,发现的网络也可以执行线性回归。我们进一步提供了数值证据,表明在实际情况下,对于某些多维分布而发生这种情况,并且随机梯度下降表现出相似的行为。我们还提供了有关初始化和优化器的选择如何影响这种行为的经验结果。
We prove that two-layer (Leaky)ReLU networks initialized by e.g. the widely used method proposed by He et al. (2015) and trained using gradient descent on a least-squares loss are not universally consistent. Specifically, we describe a large class of one-dimensional data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization landscape, since it is unable to move the biases far away from their initialization at zero. It turns out that in these cases, the found network essentially performs linear regression even if the target function is non-linear. We further provide numerical evidence that this happens in practical situations, for some multi-dimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can influence this behavior.