论文标题
Firebert:基于BERT的分类器防止对抗攻击
FireBERT: Hardening BERT-based classifiers against adversarial attack
论文作者
论文摘要
我们提出了Firebert,这是三个概念验证NLP分类器,通过生产到原始样品的多种替代方案来对抗TextFooler风格的单词扰动。在一种方法中,我们与训练数据和合成对抗样本共同调整BERT。在第二种方法中,我们通过替换单词和嵌入向量的扰动来生成合成样本。然后通过投票将多元化的评估结果组合在一起。第三种方法用嵌入向量的扰动代替了评估时间替代。我们在TextFooler生成的原始示例和对抗性示例中评估了MNLI和IMDB电影评论数据集的Firebert。我们还测试了与使用未经预定的分类器相比,在操纵Firebert时,TextFooler在操纵Firebert时创建新的对抗样本的成功。我们表明,面对对抗性攻击,可以提高基于BERT模型的准确性,而不会显着降低常规基准样品的准确性。我们将与合成数据生成器共同调整是一种高效的方法,可以防止95%的预生产的对抗样品,同时保持98%的原始基准性能的98%。我们还展示了评估时间扰动,这是进一步研究的有希望的方向,可在文本范围内主动攻击下恢复预制对手的基准性能的75%,最高为65%(来自75%的原始攻击基线 / 12%攻击)。
We present FireBERT, a set of three proof-of-concept NLP classifiers hardened against TextFooler-style word-perturbation by producing diverse alternatives to original samples. In one approach, we co-tune BERT against the training data and synthetic adversarial samples. In a second approach, we generate the synthetic samples at evaluation time through substitution of words and perturbation of embedding vectors. The diversified evaluation results are then combined by voting. A third approach replaces evaluation-time word substitution with perturbation of embedding vectors. We evaluate FireBERT for MNLI and IMDB Movie Review datasets, in the original and on adversarial examples generated by TextFooler. We also test whether TextFooler is less successful in creating new adversarial samples when manipulating FireBERT, compared to working on unhardened classifiers. We show that it is possible to improve the accuracy of BERT-based models in the face of adversarial attacks without significantly reducing the accuracy for regular benchmark samples. We present co-tuning with a synthetic data generator as a highly effective method to protect against 95% of pre-manufactured adversarial samples while maintaining 98% of original benchmark performance. We also demonstrate evaluation-time perturbation as a promising direction for further research, restoring accuracy up to 75% of benchmark performance for pre-made adversarials, and up to 65% (from a baseline of 75% orig. / 12% attack) under active attack by TextFooler.