扩展对抗性攻击以产生对抗性类概率分布

论文标题

扩展对抗性攻击以产生对抗性类概率分布

Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions

论文作者

Vadillo, Jon, Santana, Roberto, Lozano, Jose A.

论文摘要

尽管在广泛的人工智能任务中具有显着的深度学习模型的性能和概括水平，但已经证明，由于对自然输入的不可察觉而又恶意的扰动，这些模型很容易被愚弄。这些改变的输入在文献中被称为对抗性示例。在本文中，我们提出了一个新颖的概率框架，以概括和扩展对抗性攻击，以便在将攻击方法应用于大量输入时为类产生所需的概率分布。这种新颖的攻击范式为对手提供了对目标模型的更大控制，从而在各种情况下暴露了对传统范式无法进行的深度学习模型的威胁。我们引入了四种不同的策略来有效地产生此类攻击，并通过扩展多种对抗性攻击算法来说明我们的方法。我们还在实验中验证了我们的口语命令分类任务和推文情感分类任务的方法，分别在音频和文本域中的两个模范机器学习问题。我们的结果表明，我们可以在保持较高的愚弄率的同时密切近似班级的任何概率分布，甚至可以防止通过标签转移检测方法检测到攻击。

Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题