探索无限宽度限制中神经网络隐式先验的不确定性特性

论文标题

探索无限宽度限制中神经网络隐式先验的不确定性特性

Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit

论文作者

Adlam, Ben, Lee, Jaehoon, Xiao, Lechao, Pennington, Jeffrey, Snoek, Jasper

论文摘要

现代深度学习模型在许多数据模式的预测准确性方面取得了巨大的成功。但是，它们在许多现实世界任务中的应用受到不确定性估计不佳的限制，例如过度自信（OOD）数据和分配变化下的失败失败。先前的基准测试发现，神经网络（NNS）的集合通常是OOD数据上最好的校准模型。受到这一点的启发，我们利用了最新的理论进步，这些进步将无限范围内NN集合的功能空间表征为高斯过程，称为神经网络高斯过程（NNGP）。我们将NNGP带有带有SoftMax链接功能的NNGP来构建用于多级分类的概率模型，并在潜在的高斯输出上边缘化以从后部采样。这使我们可以更好地理解在功能空间上隐含的先前NNS位置，并可以直接比较NNGP的校准及其有限宽度的类似物。我们还研究了先前使用NNGP分类方法的校准，该方法将分类问题视为对单热标签的回归。在这种情况下，贝叶斯的后验是准确的，我们比较了几种启发式方法，以在类上产生一个分类的分布。我们发现这些方法在分布变化下经过了很好的校准。最后，我们考虑了无限宽度的最后一层，并结合了预训练的嵌入。这复制了转移学习的重要实用用例，并允许缩放到更大的数据集。除了实现竞争性的预测精度外，该方法比其有限宽度类似物更好地校准。

Modern deep learning models have achieved great success in predictive accuracy for many data modalities. However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-of-distribution (OOD) data and ungraceful failing under distributional shift. Previous benchmarks have found that ensembles of neural networks (NNs) are typically the best calibrated models on OOD data. Inspired by this, we leverage recent theoretical advances that characterize the function-space prior of an ensemble of infinitely-wide NNs as a Gaussian process, termed the neural network Gaussian process (NNGP). We use the NNGP with a softmax link function to build a probabilistic model for multi-class classification and marginalize over the latent Gaussian outputs to sample from the posterior. This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue. We also examine the calibration of previous approaches to classification with the NNGP, which treat classification problems as regression to the one-hot labels. In this case the Bayesian posterior is exact, and we compare several heuristics to generate a categorical distribution over classes. We find these methods are well calibrated under distributional shift. Finally, we consider an infinite-width final layer in conjunction with a pre-trained embedding. This replicates the important practical use case of transfer learning and allows scaling to significantly larger datasets. As well as achieving competitive predictive accuracy, this approach is better calibrated than its finite width analogue.

下载PDF全文

下载文献需遵守相关版权规定

论文标题