论文标题

欺骗对抗性攻击失败

Tricking Adversarial Attacks To Fail

论文作者

Lindqvist, Blerta

论文摘要

最近的对抗防御方法失败了。基于梯度的攻击导致分类器选择任何错误的类。我们的新颖的白框防御技巧不受限制地攻击成为针对指定目标类别的攻击。从这些目标类中,我们可以得出真实的类。我们的目标训练防御欺骗了基于梯度的对抗性攻击的核心的最小化:最小化(1)扰动的总和(2)分类器对抗者损失。目标训练最小的分类器更改,并用指定类标记的其他重复点(0距离)进行训练。这些标记不同的重复样本最小化的术语(1)和(2)最小化,转向攻击收敛到指定类的样本,从中得出了正确的分类。重要的是,目标训练消除了了解攻击样本的攻击样本的需求,从而最大程度地减少扰动。我们在CIFAR10中获得了CW-L2(置信度= 0)的精度为86.2%,甚至超过了非对抗性样品的无安全分类器精度。目标训练提出了对抗防御策略的根本变化。

Recent adversarial defense approaches have failed. Untargeted gradient-based attacks cause classifiers to choose any wrong class. Our novel white-box defense tricks untargeted attacks into becoming attacks targeted at designated target classes. From these target classes, we can derive the real classes. Our Target Training defense tricks the minimization at the core of untargeted, gradient-based adversarial attacks: minimize the sum of (1) perturbation and (2) classifier adversarial loss. Target Training changes the classifier minimally, and trains it with additional duplicated points (at 0 distance) labeled with designated classes. These differently-labeled duplicated samples minimize both terms (1) and (2) of the minimization, steering attack convergence to samples of designated classes, from which correct classification is derived. Importantly, Target Training eliminates the need to know the attack and the overhead of generating adversarial samples of attacks that minimize perturbations. We obtain an 86.2% accuracy for CW-L2 (confidence=0) in CIFAR10, exceeding even unsecured classifier accuracy on non-adversarial samples. Target Training presents a fundamental change in adversarial defense strategy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源