固有偏见指标与应用偏差无关

论文标题

固有偏见指标与应用偏差无关

Intrinsic Bias Metrics Do Not Correlate with Application Bias

论文作者

Goldfarb-Tarrant, Seraphina, Marchant, Rebecca, Sanchez, Ricardo Muñoz, Pandya, Mugdha, Lopez, Adam

论文摘要

自然语言处理（NLP）系统学习有害的社会偏见，导致它们在越来越多的情况下部署时会扩大不平等。为了指导这些系统的努力，NLP社区依靠各种量化模型中偏见的指标。其中一些指标是内在的，测量单词嵌入空间的偏见，有些是外在的，可以测量单词嵌入启用的下游任务中的偏见。这些内在和外在指标是否相互关联？我们比较了数百种涵盖不同任务和实验条件的训练有素的模型中的内在和外在指标。我们的结果表明，这些指标之间没有可靠的相关性，这些指标在跨任务和语言的所有方案中都存在。我们敦促研究依据的研究人员专注于偏见的外部措施，并通过创建新的挑战集和带注释的测试数据来使使用这些措施更可行。为了帮助这项工作，我们发布代码，新的内在度量标准以及注释的测试集，重点是仇恨言论中的性别偏见。

Natural Language Processing (NLP) systems learn harmful societal biases that cause them to amplify inequality as they are deployed in more and more situations. To guide efforts at debiasing these systems, the NLP community relies on a variety of metrics that quantify bias in models. Some of these metrics are intrinsic, measuring bias in word embedding spaces, and some are extrinsic, measuring bias in downstream tasks that the word embeddings enable. Do these intrinsic and extrinsic metrics correlate with each other? We compare intrinsic and extrinsic metrics across hundreds of trained models covering different tasks and experimental conditions. Our results show no reliable correlation between these metrics that holds in all scenarios across tasks and languages. We urge researchers working on debiasing to focus on extrinsic measures of bias, and to make using these measures more feasible via creation of new challenge sets and annotated test data. To aid this effort, we release code, a new intrinsic metric, and an annotated test set focused on gender bias in hate speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题