具有二次激活功能的浅神经网络的优化和概括

论文标题

具有二次激活功能的浅神经网络的优化和概括

Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

论文作者

Mannelli, Stefano Sarao, Vanden-Eijnden, Eric, Zdeborová, Lenka

论文摘要

我们研究具有二次激活函数的一个隐藏层神经网络的优化动力学和在过度参数化方案中具有二次激活函数的概括属性，其中层宽度$ m $大于输入尺寸$ d $。我们考虑了一个教师的场景，教师的结构与宽度较小的隐藏层$ m^*\ le m $相同。我们描述了经验损失格局如何受到数据样本的$ n $ $ n $影响，以及教师网络的宽度$ m^*$。特别是我们确定在经验损失上没有虚假的最小值的可能性取决于$ n $，$ d $和$ m^*$，从而确定神经网络原则上可以恢复教师的条件。我们还表明，在相同的条件下，经验损失收敛的梯度下降动力学并导致概括误差小，即它可以在实践中恢复。最后，我们表征了梯度下降的时间连接率在大量样品的极限下。这些结果通过数值实验证实。

We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the over-parametrized regime where the layer width $m$ is larger than the input dimension $d$. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width $m^*\le m$. We describe how the empirical loss landscape is affected by the number $n$ of data samples and the width $m^*$ of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on $n$, $d$, and $m^*$, thereby establishing conditions under which the neural network can in principle recover the teacher. We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice. Finally we characterize the time-convergence rate of gradient descent in the limit of a large number of samples. These results are confirmed by numerical experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题