论文标题
神经网中不变流形的几何压缩
Geometric compression of invariant manifolds in neural nets
论文作者
论文摘要
我们研究了神经网络如何压缩数据中的非信息输入空间,其中数据位于$ d $ dimensions中,但其标签仅在尺寸$ d_ \ parallel <d $的线性歧管中变化。我们表明,对于以无限重量(即在特征学习方案中)初始化的一个隐藏层网络,训练有梯度下降,第一层的权重演变成几乎对$ d_ \ perp = d_ \ perp = d_ \ d_ \ parallel $ nimaineminal informantive Directions的敏感。这些因素$λ\ sim \ sqrt {p} $有效地压缩,其中$ p $是训练集的大小。我们量化了这种压缩在测试错误$ε$上的好处。对于重量的大初始化(懒惰训练制度),不会发生压缩,对于分隔标签的常规边界,我们发现$ε\ sim p^{ - β} $,带有$β_\ text {lazy {lazy} = d /(3ddd-2)$。压缩可以改善学习曲线,以便$β_\ text {feature} =(2d-1)/(3d-2)$如果$ d_ \ parallel = 1 $和$β_\ text {feats {featum} =(d + d_ \ perp/2)/(3dd-2)/(3dd-2)/(3dd-2)$,如果$ d_ \ parallel> 1 $。我们测试这些预测的条纹模型,其中边界是并行接口($ d_ \ parallel = 1 $)以及圆柱边界($ d_ \ parallel = 2 $)。接下来,我们表明压缩会随着时间的流逝而形成神经切线内核(NTK)的演变,因此其顶部特征向量变得更加有益,并在标签上显示出更大的投影。因此,在训练结束时,内核学习的冰冻NTK的表现优于初始NTK。我们确认了在条纹模型上训练的一个隐藏层FC网络以及对MNIST训练的16层CNN的这些预测,为此我们还找到了$β_\ text {feation}>β__\ text {lazy} $。
We study how neural networks compress uninformative input space in models where data lie in $d$ dimensions, but whose label only vary within a linear manifold of dimension $d_\parallel < d$. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the first layer of weights evolve to become nearly insensitive to the $d_\perp=d-d_\parallel$ uninformative directions. These are effectively compressed by a factor $λ\sim \sqrt{p}$, where $p$ is the size of the training set. We quantify the benefit of such a compression on the test error $ε$. For large initialization of the weights (the lazy training regime), no compression occurs and for regular boundaries separating labels we find that $ε\sim p^{-β}$, with $β_\text{Lazy} = d / (3d-2)$. Compression improves the learning curves so that $β_\text{Feature} = (2d-1)/(3d-2)$ if $d_\parallel = 1$ and $β_\text{Feature} = (d + d_\perp/2)/(3d-2)$ if $d_\parallel > 1$. We test these predictions for a stripe model where boundaries are parallel interfaces ($d_\parallel=1$) as well as for a cylindrical boundary ($d_\parallel=2$). Next we show that compression shapes the Neural Tangent Kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK. We confirm these predictions both for a one-hidden layer FC network trained on the stripe model and for a 16-layers CNN trained on MNIST, for which we also find $β_\text{Feature}>β_\text{Lazy}$.