对高风险可靠性的对抗性培训

论文标题

对高风险可靠性的对抗性培训

Adversarial Training for High-Stakes Reliability

论文作者

Ziegler, Daniel M., Nix, Seraphina, Chan, Lawrence, Bauman, Tim, Schmidt-Nielsen, Peter, Lin, Tao, Scherlis, Adam, Nabeshima, Noa, Weinstein-Raun, Ben, de Haas, Daniel, Shlegeris, Buck, Thomas, Nate

论文摘要

将来，强大的AI系统可能会在高风险的设置中部署，在这些设置中，单个故障可能是灾难性的。在高风险设置中改善AI安全性的一种技术是对抗性训练，该培训使用对手来生成示例进行训练以实现更好的最差表现。在这项工作中，我们使用了安全的语言生成任务（``避免受伤''）作为通过对抗性训练实现高可靠性的测试床。我们创建了一系列的对抗训练技术 - 包括一个有助于人类对手的工具 - 以在分类器中找到和消除故障，该分类器过滤了发电机建议的文本完成。在我们的任务中，我们确定我们可以设置非常保守的分类器阈值，而不会显着影响过滤后的输出的质量。我们发现，对抗性训练提高了对我们训练的对抗攻击的鲁棒性 - 使承包商使用工具（从13到26分钟）和没有（从20到44分钟）找到对抗性实例的时间增加了一倍，而不会影响分发性能。我们希望在高风险的可靠性环境中看到进一步的工作，包括更强大的工具，以增强人类对手，以及更好的方法来衡量高水平的可靠性，直到我们可以自信地排除强大模型的灾难性部署时间失败的可能性。

In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a safe language generation task (``avoid injuries'') as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques -- including a tool that assists human adversaries -- to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题