论文标题
用投影梯度下降来量化模型梯度的优先方向
Quantifying the Preferential Direction of the Model Gradient in Adversarial Training With Projected Gradient Descent
论文作者
论文摘要
对抗性训练,尤其是预计的梯度下降(PGD),已被证明是改善对抗性攻击的鲁棒性的成功方法。在对抗训练之后,模型相对于其输入的梯度具有优先的方向。但是,对准方向在数学上并不是很好地确定,因此很难进行定量评估。我们提出了对这个方向的新颖定义,是向量指向决策空间中最接近阶层的最接近点的方向。为了在对抗训练后评估与此方向的对齐,我们应用了一个使用生成的对抗网络来产生更改图像中存在的类所需的最小残差的度量。我们表明,根据我们的定义,受PGD训练的模型比基线更高,即我们的指标比竞争的度量公式更高,并且执行这种比对增加了模型的稳健性。
Adversarial training, especially projected gradient descent (PGD), has proven to be a successful approach for improving robustness against adversarial attacks. After adversarial training, gradients of models with respect to their inputs have a preferential direction. However, the direction of alignment is not mathematically well established, making it difficult to evaluate quantitatively. We propose a novel definition of this direction as the direction of the vector pointing toward the closest point of the support of the closest inaccurate class in decision space. To evaluate the alignment with this direction after adversarial training, we apply a metric that uses generative adversarial networks to produce the smallest residual needed to change the class present in the image. We show that PGD-trained models have a higher alignment than the baseline according to our definition, that our metric presents higher alignment values than a competing metric formulation, and that enforcing this alignment increases the robustness of models.