论文标题
规避解释性:如何击败思想阅读者
Circumventing interpretability: How to defeat mind-readers
论文作者
论文摘要
人工智能(AI)系统的越来越多的能力使我们对他们的内部进行解释以确保其意图与人类价值观保持一致,这一点越来越重要。然而,有理由相信,未对准的人工智能将具有融合的工具动力,以使我们的思想难以解释。在本文中,我讨论了许多方法,即有能力的AI可能会规避可伸缩性的可解释性方法,并提出了一个思考这些潜在未来风险的框架。
The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values. Yet there is reason to believe that misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. In this article, I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks.