论文标题
深神经网络的拓扑
Topology of deep neural networks
论文作者
论文摘要
我们研究数据集的拓扑$ m = m_a \ cup m_b \ subseteq \ mathbb {r}^d $,代表二进制分类问题中的两个类$ a $ a $ a $ a $ a $ a $ a和$ b $的情况,随着训练设置的完美准确度,在训练集的准确度上,并且近乎差异($)。目的是阐明深度神经网络中的两个谜团:(i)像relu这样的非平滑激活功能胜过像双曲线切线这样的光滑的激活功能; (ii)成功的神经网络体系结构依赖于拥有许多层,即使浅网络可以近似任何任意函数。我们对真实和模拟的广泛点云数据集的持续同源进行了广泛的实验。结果始终证明了以下内容:(1)神经网络通过更改拓扑,将拓扑复杂的数据设置转换为拓扑上简单的神经网络,因为它通过层次。无论我们开头的$ m $拓扑多么复杂,当通过训练有素的神经网络$ f:\ mathbb {r}^d \ to \ mathbb {r}^p $中,这两个组成部分$ m_a $和$ m_b $的betti数量都大大减少;实际上,它们几乎总是减少到最低的值:$β_k\ bigl(f(m_i)\ bigr)= 0 $ for $ k \ ge 1 $和$β_0\ bigl(f(m_i)\ bigr)= 1 $,$,$ i = a,b $。此外,(2)比双曲线切线激活的贝蒂数量的减少速度明显快,因为前者定义了改变拓扑结构的非全美形态图,而后者定义了保留拓扑结构的同构图。最后,(3)浅网络和深层网络对数据集的转换不同 - 浅网络主要通过更改几何形状和仅在其最终层中更改拓扑来运行,深层拓扑在所有层中更均匀地传播拓扑变化。
We study how the topology of a data set $M = M_a \cup M_b \subseteq \mathbb{R}^d$, representing two classes $a$ and $b$ in a binary classification problem, changes as it passes through the layers of a well-trained neural network, i.e., with perfect accuracy on training set and near-zero generalization error ($\approx 0.01\%$). The goal is to shed light on two mysteries in deep neural networks: (i) a nonsmooth activation function like ReLU outperforms a smooth one like hyperbolic tangent; (ii) successful neural network architectures rely on having many layers, even though a shallow network can approximate any function arbitrary well. We performed extensive experiments on the persistent homology of a wide range of point cloud data sets, both real and simulated. The results consistently demonstrate the following: (1) Neural networks operate by changing topology, transforming a topologically complicated data set into a topologically simple one as it passes through the layers. No matter how complicated the topology of $M$ we begin with, when passed through a well-trained neural network $f : \mathbb{R}^d \to \mathbb{R}^p$, there is a vast reduction in the Betti numbers of both components $M_a$ and $M_b$; in fact they nearly always reduce to their lowest possible values: $β_k\bigl(f(M_i)\bigr) = 0$ for $k \ge 1$ and $β_0\bigl(f(M_i)\bigr) = 1$, $i =a, b$. Furthermore, (2) the reduction in Betti numbers is significantly faster for ReLU activation than hyperbolic tangent activation as the former defines nonhomeomorphic maps that change topology, whereas the latter defines homeomorphic maps that preserve topology. Lastly, (3) shallow and deep networks transform data sets differently -- a shallow network operates mainly through changing geometry and changes topology only in its final layers, a deep one spreads topological changes more evenly across all layers.