论文标题
自然触发器进行文本分类的通用对抗攻击
Universal Adversarial Attacks with Natural Triggers for Text Classification
论文作者
论文摘要
最近的工作证明了现代文本分类器对通用对抗性攻击的脆弱性,这些攻击是添加到分类器处理的文本中的单词的输入 - 无知序列。尽管成功,但在这种攻击中产生的单词序列通常是不语法的,并且可以很容易地与自然文本区分开。我们开发了对抗性攻击,这些攻击似乎更接近天然的英语短语,但在添加到良性输入中时会混淆分类系统。我们利用对抗正规化的自动编码器(ARAE)生成触发器,并提出基于梯度的搜索,旨在最大化下游分类器的预测损失。我们的攻击有效地降低了分类任务的模型准确性,同时根据自动检测指标和人为受试者的研究,比以前的模型降低了。我们的目的是证明可以使对抗性攻击比以前想象的更难检测,并能够开发适当的防御能力。
Recent work has demonstrated the vulnerability of modern text classifiers to universal adversarial attacks, which are input-agnostic sequences of words added to text processed by classifiers. Despite being successful, the word sequences produced in such attacks are often ungrammatical and can be easily distinguished from natural text. We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems when added to benign inputs. We leverage an adversarially regularized autoencoder (ARAE) to generate triggers and propose a gradient-based search that aims to maximize the downstream classifier's prediction loss. Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models as per automatic detection metrics and human-subject studies. Our aim is to demonstrate that adversarial attacks can be made harder to detect than previously thought and to enable the development of appropriate defenses.