神经网中不变流形的几何压缩

论文标题

神经网中不变流形的几何压缩

Geometric compression of invariant manifolds in neural nets

论文作者

Paccolat, Jonas, Petrini, Leonardo, Geiger, Mario, Tyloo, Kevin, Wyart, Matthieu

论文摘要

我们研究了神经网络如何压缩数据中的非信息输入空间，其中数据位于$ d $ dimensions中，但其标签仅在尺寸$ d_ \ parallel <d $的线性歧管中变化。我们表明，对于以无限重量（即在特征学习方案中）初始化的一个隐藏层网络，训练有梯度下降，第一层的权重演变成几乎对$ d_ \ perp = d_ \ perp = d_ \ d_ \ parallel $ nimaineminal informantive Directions的敏感。这些因素$λ\ sim \ sqrt {p} $有效地压缩，其中$ p $是训练集的大小。我们量化了这种压缩在测试错误$ε$上的好处。对于重量的大初始化（懒惰训练制度），不会发生压缩，对于分隔标签的常规边界，我们发现$ε\ sim p^{ - β} $，带有$β_\ text {lazy {lazy} = d /（3ddd-2）$。压缩可以改善学习曲线，以便$β_\ text {feature} =（2d-1）/（3d-2）$如果$ d_ \ parallel = 1 $和$β_\ text {feats {featum} =（d + d_ \ perp/2）/（3dd-2）/（3dd-2）/（3dd-2）$，如果$ d_ \ parallel> 1 $。我们测试这些预测的条纹模型，其中边界是并行接口（$ d_ \ parallel = 1 $）以及圆柱边界（$ d_ \ parallel = 2 $）。接下来，我们表明压缩会随着时间的流逝而形成神经切线内核（NTK）的演变，因此其顶部特征向量变得更加有益，并在标签上显示出更大的投影。因此，在训练结束时，内核学习的冰冻NTK的表现优于初始NTK。我们确认了在条纹模型上训练的一个隐藏层FC网络以及对MNIST训练的16层CNN的这些预测，为此我们还找到了$β_\ text {feation}>β__\ text {lazy} $。

We study how neural networks compress uninformative input space in models where data lie in $d$ dimensions, but whose label only vary within a linear manifold of dimension $d_\parallel < d$. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the first layer of weights evolve to become nearly insensitive to the $d_\perp=d-d_\parallel$ uninformative directions. These are effectively compressed by a factor $λ\sim \sqrt{p}$, where $p$ is the size of the training set. We quantify the benefit of such a compression on the test error $ε$. For large initialization of the weights (the lazy training regime), no compression occurs and for regular boundaries separating labels we find that $ε\sim p^{-β}$, with $β_\text{Lazy} = d / (3d-2)$. Compression improves the learning curves so that $β_\text{Feature} = (2d-1)/(3d-2)$ if $d_\parallel = 1$ and $β_\text{Feature} = (d + d_\perp/2)/(3d-2)$ if $d_\parallel > 1$. We test these predictions for a stripe model where boundaries are parallel interfaces ($d_\parallel=1$) as well as for a cylindrical boundary ($d_\parallel=2$). Next we show that compression shapes the Neural Tangent Kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK. We confirm these predictions both for a one-hidden layer FC network trained on the stripe model and for a 16-layers CNN trained on MNIST, for which we also find $β_\text{Feature}>β_\text{Lazy}$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题