论文标题
通过反事实逆转来减轻蒸馏语言模型中的性别偏见
Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal
论文作者
论文摘要
语言模型在生成相干文本和模型压缩技术(例如知识蒸馏)方面表现出色,已使其在资源受限的设置中使用。但是,这些模型可能以多种方式偏见,包括男性和女性与性别中性职业的毫无根据的关联。因此,没有任何公平限制的知识蒸馏可能会保留或夸大教师模型对蒸馏模型的偏见。为此,我们提出了一种新颖的方法来通过在知识蒸馏期间学习公平的模型来减轻文本生成中的性别差异。我们基于反事实逆转$ \ unicode {x2014} $修改教师概率并增加培训集的基础知识蒸馏提出了两次修改。我们评估了由由此产生的蒸馏和填充的GPT $ \ UNICODE {X2012} $ 2模型产生的开放式文本中的性别极性,并证明了性别差异的大幅度降低,仅在实用程序中造成了较小的损害。最后,我们观察到,在语言产生中降低性别极性的语言模型并不能改善嵌入公平或下游分类的公平性。
Language models excel at generating coherent text, and model compression techniques such as knowledge distillation have enabled their use in resource-constrained settings. However, these models can be biased in multiple ways, including the unfounded association of male and female genders with gender-neutral professions. Therefore, knowledge distillation without any fairness constraints may preserve or exaggerate the teacher model's biases onto the distilled model. To this end, we present a novel approach to mitigate gender disparity in text generation by learning a fair model during knowledge distillation. We propose two modifications to the base knowledge distillation based on counterfactual role reversal$\unicode{x2014}$modifying teacher probabilities and augmenting the training set. We evaluate gender polarity across professions in open-ended text generated from the resulting distilled and finetuned GPT$\unicode{x2012}$2 models and demonstrate a substantial reduction in gender disparity with only a minor compromise in utility. Finally, we observe that language models that reduce gender polarity in language generation do not improve embedding fairness or downstream classification fairness.