有限与无限神经网络：一项实证研究

论文标题

有限与无限神经网络：一项实证研究

Finite Versus Infinite Neural Networks: an Empirical Study

论文作者

Lee, Jaehoon, Schoenholz, Samuel S., Pennington, Jeffrey, Adlam, Ben, Xiao, Lechao, Novak, Roman, Sohl-Dickstein, Jascha

论文摘要

我们对广泛的神经网络和内核方法之间的对应关系进行了仔细，透彻和大规模的经验研究。通过这样做，我们解决了与无限广泛神经网络的研究有关的各种开放问题。我们的实验结果包括：内核方法的表现优于完全连接的有限宽度网络，但表现不佳的卷积有限宽度网络；神经网络高斯过程（NNGP）内核经常优于神经切线（NT）内核；中心和结合的有限网络减少了后方差，并且与无限网络的行为更加类似。重量衰减和大型学习率的使用破坏了有限网络和无限网络之间的对应关系； NTK参数化优于有限宽度网络的标准参数化；内核的对角正规化的作用与早期停止相似。浮点精度将内核性能限制在关键数据集大小之外；正则化ZCA美白提高了准确性；有限的网络性能以非单调的方式取决于宽度，以双重下降现象捕获的方式。 CNN的均衡性仅对远离内核制度的狭窄网络有益。我们的实验还激发了改进的层缩放量表，以改善重量衰减，从而改善了有限宽度网络中的概括。最后，我们开发了改进的最佳实践，用于使用NNGP和NT内核进行预测，包括一种新颖的结合技术。使用这些最佳实践，我们实现了与我们考虑的每个体系结构类相对应的内核的CIFAR-10分类结果。

We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider.

下载PDF全文

下载文献需遵守相关版权规定

论文标题