论文标题
更安全的代理激励措施的特定路径目标
Path-Specific Objectives for Safer Agent Incentives
论文作者
论文摘要
我们提出了一个培训安全代理商的一般框架,其天真的激励措施是不安全的。例如,操纵或欺骗性行为可以提高奖励,但应避免。大多数方法在这里失败:代理通过任何必要的方式最大化预期的回报。我们正式描述了国家的“精致”部分的设置,不应将其用作结束的手段。然后,我们使用因果影响图分析来训练代理,以最大程度地提高行动对预期回报的因果影响,而该动作对未由状态的细腻部分介导的预期回报。由此产生的代理没有动力来控制微妙的状态。我们进一步展示了我们的框架如何统一和概括现有的建议。
We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most approaches fail here: agents maximize expected return by any means necessary. We formally describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state, using Causal Influence Diagram analysis. The resulting agents have no incentive to control the delicate state. We further show how our framework unifies and generalizes existing proposals.