在线随机梯度下降对高维推断的非凸损失

论文标题

在线随机梯度下降对高维推断的非凸损失

Online stochastic gradient descent on non-convex losses from high-dimensional inference

论文作者

Arous, Gerard Ben, Gheissari, Reza, Jagannath, Aukosh

论文摘要

随机梯度下降（SGD）是一种在高维推理任务中出现的优化问题的流行算法。在这里，通过迭代优化损耗函数，从独立数据样本中产生一个未知参数的估计器。此损耗函数是随机的，通常是非凸。我们从参数空间高维空间的设置中的随机开始研究了最简单版本的SGD的性能，即在线SGD。我们为一致估计所需的样品数量开发了几乎尖锐的阈值，因为一个尺寸会变化。我们的阈值仅取决于人口损失的内在特性，我们称之为信息指数。特别是，我们的结果不对损失本身的统一控制，例如凸度或统一的导数界限。我们获得的阈值在维度上是多项式的，精确的指数明确取决于信息指数。由于我们的结果，我们发现，除了最简单的任务外，几乎所有数据仅在初始搜索阶段仅使用，以获得与地面真相的非平凡相关性。达到非平凡相关性后，下降是迅速的，并且表现出大型类型行为的定律。我们通过将其应用于广义线性模型，在线PCA和尖刺张量模型等广泛的推理任务以及参数估计以及对具有通用激活功能的单层网络的监督学习来说明我们的方法。

Stochastic gradient descent (SGD) is a popular algorithm for optimization problems arising in high-dimensional inference tasks. Here one produces an estimator of an unknown parameter from independent samples of data by iteratively optimizing a loss function. This loss function is random and often non-convex. We study the performance of the simplest version of SGD, namely online SGD, from a random start in the setting where the parameter space is high-dimensional. We develop nearly sharp thresholds for the number of samples needed for consistent estimation as one varies the dimension. Our thresholds depend only on an intrinsic property of the population loss which we call the information exponent. In particular, our results do not assume uniform control on the loss itself, such as convexity or uniform derivative bounds. The thresholds we obtain are polynomial in the dimension and the precise exponent depends explicitly on the information exponent. As a consequence of our results, we find that except for the simplest tasks, almost all of the data is used simply in the initial search phase to obtain non-trivial correlation with the ground truth. Upon attaining non-trivial correlation, the descent is rapid and exhibits law of large numbers type behavior. We illustrate our approach by applying it to a wide set of inference tasks such as phase retrieval, and parameter estimation for generalized linear models, online PCA, and spiked tensor models, as well as to supervised learning for single-layer networks with general activation functions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题