自适应抽样策略来构建公平培训数据集

论文标题

自适应抽样策略来构建公平培训数据集

Adaptive Sampling Strategies to Construct Equitable Training Datasets

论文作者

Cai, William, Encarnacion, Ro, Chern, Bobbie, Corbett-Davies, Sam, Bogen, Miranda, Bergman, Stevie, Goel, Sharad

论文摘要

在从计算机视觉到自然语言处理的领域中，已经证明机器学习模型表现出明显的差异，对于传统服务不足的群体的成员而言，其表现较差。导致这些性能差距的一个因素是对模型进行的数据缺乏表示。但是，通常不清楚如何在特定应用程序中实现代表性。在这里，我们正式化了创建公平培训数据集的问题，并提出了解决此问题的统计框架。我们考虑一个设置模型构建器必须决定如何分配固定数据收集预算以从不同的子组中收集培训数据。然后，我们将数据集创建作为一个受约束的优化问题，其中人们基于（估计）特定的学习率和每个样本的成本最大化了特定于组的性能指标的函数。这种灵活的方法结合了模型构建者和其他利益相关者的偏好，以及学习任务的统计特性。当数据收集决策依次做出时，我们表明在某些条件下，即使没有对学习率的先验知识，也可以有效地解决此优化问题。为了说明我们的方法，我们对合成基因组数据的多基因风险评分进行了模拟研究，该数据域通常遭受非代表性数据收集的应用。我们发现，我们的自适应抽样策略的表现优于几个常见的数据收集启发式方法，包括平等和比例采样，证明了战略数据集设计对构建公平模型的价值。

In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题