论文标题

合并的楼梯属性:两层神经网络上稀疏功能的SGD学习的必要条件

The merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks

论文作者

Abbe, Emmanuel, Boix-Adsera, Enric, Misiakiewicz, Theodor

论文摘要

目前已知如何表征神经网络可以通过SGD学习两个极端参数化的功能:线性状态中的神经网络和没有结构约束的神经网络。但是,尽管有重大的发展,但对于兴趣的主要参数(非线性和常规网络)仍未达到严格的特征。 我们通过考虑由SGD训练的均值场政权训练的DEPTH-2神经网络迈出了这一方向。我们考虑取决于潜在的低维子空间(即少数坐标)的二进制输入功能。该制度引起了人们的关注,因为它鲜为人知的神经网络如何常规处理高维数据集并适应潜在的低维结构而不会受到维度的诅咒。因此,我们在较大的环境尺寸$ d $中使用$ o(d)$样品复杂性研究SGD可行性。 我们的主要结果是层次结构属性,即“合并楼梯属性”,在这种情况下既需要学习,又足以学习。 我们进一步表明,非线性训练是必要的:对于此类功能,任何特征图上的线性方法(例如NTK)都无法有效学习。关键工具是一种新的“无维”动力学近似结果,适用于在低维度的潜在空间定义的功能,基于多项式认同测试的全局收敛证明,以及针对非极端正交功能的线性方法的下限改进。

It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parameterizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest (non-linear but regular networks) no tight characterization has yet been achieved, despite significant developments. We take a step in this direction by considering depth-2 neural networks trained by SGD in the mean-field regime. We consider functions on binary inputs that depend on a latent low-dimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly understood how neural networks routinely tackle high-dimensional datasets and adapt to latent low-dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGD-learnability with $O(d)$ sample complexity in a large ambient dimension $d$. Our main results characterize a hierarchical property, the "merged-staircase property", that is both necessary and nearly sufficient for learning in this setting. We further show that non-linear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new "dimension-free" dynamics approximation result that applies to functions defined on a latent space of low-dimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for non-almost orthogonal functions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源