论文标题
对NLP模型的后门攻击具有强大的扰动防御
Backdoor Attack against NLP models with Robustness-Aware Perturbation defense
论文作者
论文摘要
后门攻击打算将隐藏的后门嵌入深神经网络(DNN)中,以使攻击模型在良性样本上的表现良好,而如果攻击者定义的触发器激活了隐藏的后门,则其预测将发生恶意改变。当培训过程未完全控制时,例如第三方数据集培训或采用第三方模型时,可能会发生这种威胁。有很多研究和不同的方法来捍卫这种类型的后门攻击,其中一种是基于扰动的防御方法。这种方法主要利用中毒样品和干净样品之间的鲁棒性差距很大。在我们的工作中,我们通过使用对抗性训练步骤控制中毒样品和干净样品之间的稳健性差距来打破这种防御。
Backdoor attack intends to embed hidden backdoor into deep neural networks (DNNs), such that the attacked model performs well on benign samples, whereas its prediction will be maliciously changed if the hidden backdoor is activated by the attacker defined trigger. This threat could happen when the training process is not fully controlled, such as training on third-party data-sets or adopting third-party models. There has been a lot of research and different methods to defend such type of backdoor attacks, one being robustness-aware perturbation-based defense method. This method mainly exploits big gap of robustness between poisoned and clean samples. In our work, we break this defense by controlling the robustness gap between poisoned and clean samples using adversarial training step.