论文标题
像(var)Pro这样的训练:对具有可变投影的神经网络的有效培训
Train Like a (Var)Pro: Efficient Training of Neural Networks with Variable Projection
论文作者
论文摘要
深度神经网络(DNNS)已经在各种传统的机器学习任务(例如语音识别,图像分类和细分)中实现了最先进的表现。 DNN有效近似高维函数的能力也促使其在科学应用中的使用,例如求解部分微分方程(PDE)并生成替代模型。在本文中,我们考虑了对DNN的监督培训,这是在上述许多应用中产生的。我们专注于优化给定DNN的权重的中心问题,以便它准确地近似观察到的输入和目标数据之间的关系。众所周知,为这个优化问题设计有效的解决器是由于重量,非凸度,数据符号和非平凡的超参数选择而受到挑战。为了更有效地解决优化问题,我们提出了可变投影(VARPRO)的使用,这是一种最初为可分开的非线性最小二乘问题设计的方法。我们的主要贡献是Gauss-Newton Varpro方法(GNVPRO),该方法将VARPRO思想的覆盖范围扩展到了非二次目标函数,最值得注意的是,分类中引起的跨凝性损失函数。这些扩展使GNVPro适用于所有涉及DNN的培训问题,其最后一层是仿射映射,这在许多最新的架构中很常见。在我们来自替代建模,分割和分类GNVPRO的四个数值实验中,GNVPRO比常用的随机梯度下降(SGD)方案更有效地解决了优化问题。同样,GNVPRO找到了可以很好地概括的解决方案,除了一个示例外,除了一个示例比调整良好的SGD方法外,都可以看不见数据点。
Deep neural networks (DNNs) have achieved state-of-the-art performance across a variety of traditional machine learning tasks, e.g., speech recognition, image classification, and segmentation. The ability of DNNs to efficiently approximate high-dimensional functions has also motivated their use in scientific applications, e.g., to solve partial differential equations (PDE) and to generate surrogate models. In this paper, we consider the supervised training of DNNs, which arises in many of the above applications. We focus on the central problem of optimizing the weights of the given DNN such that it accurately approximates the relation between observed input and target data. Devising effective solvers for this optimization problem is notoriously challenging due to the large number of weights, non-convexity, data-sparsity, and non-trivial choice of hyperparameters. To solve the optimization problem more efficiently, we propose the use of variable projection (VarPro), a method originally designed for separable nonlinear least-squares problems. Our main contribution is the Gauss-Newton VarPro method (GNvpro) that extends the reach of the VarPro idea to non-quadratic objective functions, most notably, cross-entropy loss functions arising in classification. These extensions make GNvpro applicable to all training problems that involve a DNN whose last layer is an affine mapping, which is common in many state-of-the-art architectures. In our four numerical experiments from surrogate modeling, segmentation, and classification GNvpro solves the optimization problem more efficiently than commonly-used stochastic gradient descent (SGD) schemes. Also, GNvpro finds solutions that generalize well, and in all but one example better than well-tuned SGD methods, to unseen data points.