频谱偏差和任务模型对准解释了内核回归和无限宽的神经网络中的概括

论文标题

频谱偏差和任务模型对准解释了内核回归和无限宽的神经网络中的概括

Spectral Bias and Task-Model Alignment Explain Generalization in Kernel Regression and Infinitely Wide Neural Networks

论文作者

Canatar, Abdulkadir, Bordelon, Blake, Pehlevan, Cengiz

论文摘要

超越培训数据集的概括是机器学习的主要目标，但是对泛化的理论理解仍然是许多模型的开放问题。在深层神经网络中的最新观察结果使过度参数化导致更好的性能，这与经典统计的传统智慧相矛盾，因此对新理论的需求加剧了。在本文中，我们调查了内核回归的概括误差，除了是一种流行的机器学习方法外，还包括无限过度参数化的神经网络，该网络训练有梯度下降。我们使用来自统计力学的技术来得出适用于任何内核或数据分布的概括错误的分析表达式。我们介绍了理论对真实和合成数据集的应用，以及许多内核，包括由无限宽度限制训练深层神经网络产生的内核。我们阐明了内核回归的归纳偏置，以用“简单函数”来解释数据，这些数据是通过在数据分布上求解内核本本函数问题来识别的。这种简单的概念使我们能够表征内核是否与学习任务兼容，从而从少数培训示例中促进了良好的概括性能。我们表明，当内核嘈杂或无法表达时，更多的数据可能会损害概括，从而导致可能具有许多峰的非单调学习曲线。为了进一步理解这些现象，我们转向了广泛的旋转不变核，这与训练无限宽度限制中的深神经网络有关，并在从球形对称分布中绘制数据时对它们进行详细的数学分析，并且输入维度的数量很大。

Generalization beyond a training dataset is a main goal of machine learning, but theoretical understanding of generalization remains an open problem for many models. The need for a new theory is exacerbated by recent observations in deep neural networks where overparameterization leads to better performance, contradicting the conventional wisdom from classical statistics. In this paper, we investigate generalization error for kernel regression, which, besides being a popular machine learning method, also includes infinitely overparameterized neural networks trained with gradient descent. We use techniques from statistical mechanics to derive an analytical expression for generalization error applicable to any kernel or data distribution. We present applications of our theory to real and synthetic datasets, and for many kernels including those that arise from training deep neural networks in the infinite-width limit. We elucidate an inductive bias of kernel regression to explain data with "simple functions", which are identified by solving a kernel eigenfunction problem on the data distribution. This notion of simplicity allows us to characterize whether a kernel is compatible with a learning task, facilitating good generalization performance from a small number of training examples. We show that more data may impair generalization when noisy or not expressible by the kernel, leading to non-monotonic learning curves with possibly many peaks. To further understand these phenomena, we turn to the broad class of rotation invariant kernels, which is relevant to training deep neural networks in the infinite-width limit, and present a detailed mathematical analysis of them when data is drawn from a spherically symmetric distribution and the number of input dimensions is large.

下载PDF全文

下载文献需遵守相关版权规定

论文标题