论文标题
线性上下文匪徒的群体精神公平
Group Meritocratic Fairness in Linear Contextual Bandits
论文作者
论文摘要
我们研究线性上下文匪徒问题,其中代理必须从一个池中选择一个候选人,每个候选人属于敏感组。在这种情况下,候选人的奖励可能无法直接可比,例如,当代理人是雇主雇用来自不同种族的候选人的雇主时,由于歧视性偏见和/或社会不公正,有些群体的奖励较低。我们提出了一个公平的概念,该概念指出,当代理人选择一个相对排名最高的候选人时,它是公平的,这可以衡量与同一组的候选人相比,奖励的良好程度。这是一个非常强烈的公平概念,因为代理没有直接观察到相对等级,而取决于基本的奖励模型和奖励的分布。因此,我们研究了学习政策的问题,该策略在群体之间是独立的,每个群体的奖励分配是绝对连续的。特别是,我们设计了一个贪婪的政策,在每个回合中,从观察到的上下文奖励对构建了脊回归估计,然后使用经验累积分布函数计算每个候选人的相对等级的估计值。我们证明,尽管具有简单性和缺乏初始探索阶段,但贪婪的政策取得了成就,达到了日志因素,并且具有很高的概率,$ \ sqrt $ \ sqrt {dt} $ re $ t $ t $ rounds之后,$ d $ ye $ d $是上下文vectors的尺寸。当选择之前所有可能的信息时,该策略在每个回合中还可以满足每个回合的人口统计。最后,我们在美国人口普查数据上使用模拟设置和实验,以表明我们的策略在实践中也可以实现次线性公平伪rebret。
We study the linear contextual bandit problem where an agent has to select one candidate from a pool and each candidate belongs to a sensitive group. In this setting, candidates' rewards may not be directly comparable between groups, for example when the agent is an employer hiring candidates from different ethnic groups and some groups have a lower reward due to discriminatory bias and/or social injustice. We propose a notion of fairness that states that the agent's policy is fair when it selects a candidate with highest relative rank, which measures how good the reward is when compared to candidates from the same group. This is a very strong notion of fairness, since the relative rank is not directly observed by the agent and depends on the underlying reward model and on the distribution of rewards. Thus we study the problem of learning a policy which approximates a fair policy under the condition that the contexts are independent between groups and the distribution of rewards of each group is absolutely continuous. In particular, we design a greedy policy which at each round constructs a ridge regression estimate from the observed context-reward pairs, and then computes an estimate of the relative rank of each candidate using the empirical cumulative distribution function. We prove that, despite its simplicity and the lack of an initial exploration phase, the greedy policy achieves, up to log factors and with high probability, a fair pseudo-regret of order $\sqrt{dT}$ after $T$ rounds, where $d$ is the dimension of the context vectors. The policy also satisfies demographic parity at each round when averaged over all possible information available before the selection. Finally, we use simulated settings and experiments on the US census data to show that our policy achieves sub-linear fair pseudo-regret also in practice.