论文标题

人类偏爱学习奖励功能的模型

Models of human preference for learning reward functions

论文作者

Knox, W. Bradley, Hatgis-Kessell, Stephane, Booth, Serena, Niekum, Scott, Stone, Peter, Allievi, Alessandro

论文摘要

强化学习的效用受奖励功能与人类利益相关者的利益的一致性的限制。一种有希望的对齐方法是从轨迹段之间的人类生成的偏好中学习奖励功能,这是一种从人类反馈(RLHF)中学习的一种强化学习。通常认为这些人类的偏好仅通过部分回报(沿每个细分市场的奖励之和)来告知。我们发现,这种假设是有缺陷的,并提出了对人类偏好的建模,而是按照每个细分市场的遗憾所告知的,这是对部分偏离最佳决策的度量。鉴于由于遗憾而产生的许多偏好,我们证明我们可以确定与产生这些偏好的奖励功能等效的奖励函数,并且我们证明了先前的部分返回模型在多种情况下缺乏此可识别性属性。我们从经验上表明,我们提出的遗憾偏好模型的表现优于部分返回偏好模型,其中有限的培训数据在相同的设置中。此外,我们发现我们提出的遗憾偏好模型可以更好地预测真正的人类偏好,并从这些偏好中学习奖励功能,这些偏好导致政策变得更好。总体而言,这项工作确定了偏好模型的选择是有影响力的,而我们提出的遗憾偏好模型可以改善最近研究的核心假设。我们已经开了开源的实验代码,我们收集的人类偏好数据集以及用于收集这样一个数据集的培训和偏好启发接口。

The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments, a type of reinforcement learning from human feedback (RLHF). These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling human preferences instead as informed by each segment's regret, a measure of a segment's deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences, and we prove that the previous partial return model lacks this identifiability property in multiple contexts. We empirically show that our proposed regret preference model outperforms the partial return preference model with finite training data in otherwise the same setting. Additionally, we find that our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research. We have open sourced our experimental code, the human preferences dataset we gathered, and our training and preference elicitation interfaces for gathering a such a dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源