论文标题

随机梯度下降接近最佳吗?

Is Stochastic Gradient Descent Near Optimal?

论文作者

Zhu, Yifan, Jeon, Hong Jun, Van Roy, Benjamin

论文摘要

在过去的十年中,神经网络的成功已将它们确立为许多相关数据生成过程的有效模型。神经网络上的统计理论表明样本复杂性的优雅缩放。例如,Joen&van Roy(Arxiv:2203.00246)证明,当具有$ W $参数的Relu教师网络生成数据时,最佳学习者只需要$ \ tilde {o}(w/ε)$样本即可获得预期的预期错误$ε$。但是,现有的计算理论表明,即使对于单层层教师网络,为了达到所有此类教师网络的小错误,实现此样本复杂性所需的计算也很棘手。在这项工作中,我们将单层神经网络拟合到由单层层的Relu教师网络生成的数据,其参数来自自然分布。我们证明,具有自动宽度选择的随机梯度下降(SGD)达到了预期误差小的较小的预期误差,许多样本和查询总数几乎在输入维度和宽度中几乎是线性的。这表明SGD几乎以计算上有效的方式实现了Joen&van Roy(Arxiv:2203.00246)的信息理论样本复杂性界限。我们积极的经验结果与负理论结果之间的一个重要区别在于,后者介绍了确定性算法的最坏情况误差,而我们的分析集中在随机算法的预期误差上。

The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Joen & Van Roy (arXiv:2203.00246) demonstrate that, when data is generated by a ReLU teacher network with $W$ parameters, an optimal learner needs only $\tilde{O}(W/ε)$ samples to attain expected error $ε$. However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable. In this work, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that stochastic gradient descent (SGD) with automated width selection attains small expected error with a number of samples and total number of queries both nearly linear in the input dimension and width. This suggests that SGD nearly achieves the information-theoretic sample complexity bounds of Joen & Van Roy (arXiv:2203.00246) in a computationally efficient manner. An important difference between our positive empirical results and the negative theoretical results is that the latter address worst-case error of deterministic algorithms, while our analysis centers on expected error of a stochastic algorithm.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源