走向机器学习理论

论文标题

走向机器学习理论

Towards a theory of machine learning

论文作者

Vanchurin, Vitaly

论文摘要

我们将神经网络定义为由（1）一个状态矢量，（2）输入投影，（3）输出投影，（4）一个偏置矩阵，（5）偏置矢量，（6）激活图和（7）损耗函数的分子。我们认为，可以将损耗函数强加于边界（即输入和/或输出神经元），也可以在监督和无监督系统的散装（即隐藏神经元）上强加于。我们应用最大熵的原理来得出状态向量的规范集合，受到Lagrange乘数（或反向温度参数）对批量损失函数的约束。我们表明，在平衡中，规范分区函数必须是两个因素的产物：温度的函数和偏置矢量和权重矩阵的函数。因此，香农的总熵由两个术语组成，分别代表热力学熵和神经网络的复杂性。我们得出学习的第一和第二定律：在学习期间，必须降低熵，直到系统达到平衡（即第二定律），并且损失函数的增加必须与热力学熵的增量成比例，加上复杂性的增量（即第一定律）。我们计算出熵的破坏，以表明学习效率是由总自由能的拉普拉斯（Laplacian）给出的，这将在最佳的神经体系结构中最大化，并解释为什么在具有大量隐藏层的深层网络中，优化条件可以更好地满足。通过使用随机梯度下降方法训练监督的前馈神经网络，通过数值验证模型的关键特性。我们还讨论了整个宇宙在其最基本水平上都是神经网络的可能性。

We define a neural network as a septuple consisting of (1) a state vector, (2) an input projection, (3) an output projection, (4) a weight matrix, (5) a bias vector, (6) an activation map and (7) a loss function. We argue that the loss function can be imposed either on the boundary (i.e. input and/or output neurons) or in the bulk (i.e. hidden neurons) for both supervised and unsupervised systems. We apply the principle of maximum entropy to derive a canonical ensemble of the state vectors subject to a constraint imposed on the bulk loss function by a Lagrange multiplier (or an inverse temperature parameter). We show that in an equilibrium the canonical partition function must be a product of two factors: a function of the temperature and a function of the bias vector and weight matrix. Consequently, the total Shannon entropy consists of two terms which represent respectively a thermodynamic entropy and a complexity of the neural network. We derive the first and second laws of learning: during learning the total entropy must decrease until the system reaches an equilibrium (i.e. the second law), and the increment in the loss function must be proportional to the increment in the thermodynamic entropy plus the increment in the complexity (i.e. the first law). We calculate the entropy destruction to show that the efficiency of learning is given by the Laplacian of the total free energy which is to be maximized in an optimal neural architecture, and explain why the optimization condition is better satisfied in a deep network with a large number of hidden layers. The key properties of the model are verified numerically by training a supervised feedforward neural network using the method of stochastic gradient descent. We also discuss a possibility that the entire universe on its most fundamental level is a neural network.

下载PDF全文

下载文献需遵守相关版权规定

论文标题