论文标题
算法公平数据集:到目前为止的故事
Algorithmic Fairness Datasets: the Story so Far
论文作者
论文摘要
数据驱动的算法在不同领域中进行了研究,以支持关键决策,直接影响人们的福祉。结果,越来越多的研究人员社区一直在调查现有算法的平等和提出新颖的算法,以促进对历史上处于弱势群体人群的自动决策的风险和机会的理解。公平机器学习的进展取决于数据,只有在充分记录的情况下才能适当使用。不幸的是,算法公平社区遭受了集体数据文档债务,原因是缺乏有关特定资源的信息(不透明度)和可用信息的分散性(稀疏)。在这项工作中,我们通过调查算法公平研究中使用的200多个数据集,并为每个数据提供标准化和可搜索的文档来针对数据文档债务。此外,我们严格地确定了三个最受欢迎的公平数据集,即成人,Compas和德国信贷,我们为此进行了深入的文档。 这种统一的文档工作支持多个贡献。首先,我们总结了成人,Compas和德国信贷的优点和局限性,并增加并统一了最近的奖学金,并质疑他们作为通用公平基准的适用性。其次,我们记录并总结了数百种可用的替代方案,注释其领域并支持公平任务,以及对公平研究人员感兴趣的其他特性。最后,我们从五个重要数据策划主题的角度分析了这些数据集:匿名,同意,包容性,敏感属性和透明度。我们讨论了对这些主题的不同方法和关注程度,使其有形,并将其提炼成一系列新型资源的最佳实践。
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being. As a result, a growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations. Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented. Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity). In this work, we target data documentation debt by surveying over two hundred datasets employed in algorithmic fairness research, and producing standardized and searchable documentation for each of them. Moreover we rigorously identify the three most popular fairness datasets, namely Adult, COMPAS and German Credit, for which we compile in-depth documentation. This unifying documentation effort supports multiple contributions. Firstly, we summarize the merits and limitations of Adult, COMPAS and German Credit, adding to and unifying recent scholarship, calling into question their suitability as general-purpose fairness benchmarks. Secondly, we document and summarize hundreds of available alternatives, annotating their domain and supported fairness tasks, along with additional properties of interest for fairness researchers. Finally, we analyze these datasets from the perspective of five important data curation topics: anonymization, consent, inclusivity, sensitive attributes, and transparency. We discuss different approaches and levels of attention to these topics, making them tangible, and distill them into a set of best practices for the curation of novel resources.