论文标题
功能纯化:对抗训练如何执行强大的深度学习
Feature Purification: How Adversarial Training Performs Robust Deep Learning
论文作者
论文摘要
尽管使用对抗性训练捍卫深度学习模型免受对抗性扰动的经验成功,但到目前为止,仍然不清楚对对抗性扰动的存在背后的原则是什么,而对手培训对神经网络的做法是什么来消除它们。 在本文中,我们提出了一种我们称为特征纯化的原则,在该原则中,我们表明存在对抗性例子的原因之一是在神经网络的训练过程中,在隐藏的重量中积累了某些小型密集混合物;更重要的是,对抗训练的目标之一是去除此类混合物以净化隐藏的重量。我们介绍了CIFAR-10数据集上的两个实验,以说明这一原理,并且一个理论上的结果证明,对于某些自然分类任务,使用随机初始初始化梯度下降训练具有RELU激活的两层神经网络确实满足了这一原理。 从技术上讲,我们据我们所知,第一个结果证明了以下两个可以同时保持使用RELU激活的神经网络。 (1)对原始数据的训练确实对某些半径的小对抗扰动确实不舒适。 (2)即使使用经验性扰动算法(例如FGM),对抗性训练实际上也可以与同一半径的任何扰动相对强大。最后,我们还证明了复杂性下限,表明该网络的低复杂性模型,例如线性分类器,低度多项式,甚至是该网络的神经切线核,无论使用哪种算法用于训练它们。
Despite the empirical success of using Adversarial Training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what adversarial training does to the neural network to remove them. In this paper, we present a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network; and more importantly, one of the goals of adversarial training is to remove such mixtures to purify hidden weights. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly initialized gradient descent indeed satisfies this principle. Technically, we give, to the best of our knowledge, the first result proving that the following two can hold simultaneously for training a neural network with ReLU activation. (1) Training over the original data is indeed non-robust to small adversarial perturbations of some radius. (2) Adversarial training, even with an empirical perturbation algorithm such as FGM, can in fact be provably robust against ANY perturbations of the same radius. Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.