论文标题
纠缠水印作为防御模型提取的防御
Entangled Watermarks as a Defense against Model Extraction
论文作者
论文摘要
机器学习涉及昂贵的数据收集和培训程序。模型所有者可能会担心,如果对手安装模型提取攻击,可能会泄漏有价值的知识产权。由于很难在不牺牲明显的预测准确性的情况下防御模型提取,因此水印会利用未使用的模型能力使模型过度拟合到超出输入输出对。这样的对是水印,这些水印不是从任务分布中取样的,仅是防御者知道的。然后,防御者展示了投入输出对的知识,以声称推断模型的所有权。水印的有效性仍然有限,因为它们与任务分布不同,因此可以通过压缩或其他形式的知识转移轻松去除。 我们引入纠缠的水印嵌入(EWE)。我们的方法鼓励该模型学习以从编码水印的任务分布和数据中对数据进行分类的特征。试图去除与合法数据纠缠的水印的对手也被迫在合法数据上牺牲绩效。关于MNIST,时尚摄影者,CIFAR-10和语音命令的实验证明,防守者可以以95 \%的信心声称模型所有权,对被盗副本的查询少于100个查询,以适度的费用低于0.81个百分点,以平均为0.81个百分点。
Machine learning involves expensive data collection and training procedures. Model owners may be concerned that valuable intellectual property can be leaked if adversaries mount model extraction attacks. As it is difficult to defend against model extraction without sacrificing significant prediction accuracy, watermarking instead leverages unused model capacity to have the model overfit to outlier input-output pairs. Such pairs are watermarks, which are not sampled from the task distribution and are only known to the defender. The defender then demonstrates knowledge of the input-output pairs to claim ownership of the model at inference. The effectiveness of watermarks remains limited because they are distinct from the task distribution and can thus be easily removed through compression or other forms of knowledge transfer. We introduce Entangled Watermarking Embeddings (EWE). Our approach encourages the model to learn features for classifying data that is sampled from the task distribution and data that encodes watermarks. An adversary attempting to remove watermarks that are entangled with legitimate data is also forced to sacrifice performance on legitimate data. Experiments on MNIST, Fashion-MNIST, CIFAR-10, and Speech Commands validate that the defender can claim model ownership with 95\% confidence with less than 100 queries to the stolen copy, at a modest cost below 0.81 percentage points on average in the defended model's performance.