论文标题
基于梯度的方法中的固有噪声
Inherent Noise in Gradient Based Methods
论文作者
论文摘要
先前的工作已经通过分析基于梯度的算法(例如GD和SGD)来研究较大能力神经网络比小较小的工作更好地概括的能力。噪声的存在及其对参数扰动的鲁棒性的影响与概括有关。我们检查了GD和SGD的属性,即,而不是通过网络中的所有标量权重迭代并一一一个更新它们,而是同时更新所有参数。结果,每个参数$ w^i $在陈旧的参数$ \ mathbf {w_t} $上计算其部分导数,但随后遭受损失$ \ hat {l}(\ mathbf {w_ w_ {w_ {t+1}}})$。我们表明,这会导致噪声引入优化。我们发现,这种噪声会惩罚对权重扰动敏感的模型。我们发现,对于当前用于更新的批次而言,罚款最为明显,并且对于大型型号而言更高。
Previous work has examined the ability of larger capacity neural networks to generalize better than smaller ones, even without explicit regularizers, by analyzing gradient based algorithms such as GD and SGD. The presence of noise and its effect on robustness to parameter perturbations has been linked to generalization. We examine a property of GD and SGD, namely that instead of iterating through all scalar weights in the network and updating them one by one, GD (and SGD) updates all the parameters at the same time. As a result, each parameter $w^i$ calculates its partial derivative at the stale parameter $\mathbf{w_t}$, but then suffers loss $\hat{L}(\mathbf{w_{t+1}})$. We show that this causes noise to be introduced into the optimization. We find that this noise penalizes models that are sensitive to perturbations in the weights. We find that penalties are most pronounced for batches that are currently being used to update, and are higher for larger models.