论文标题
通过评估社会群体反事实来检测公平的仇恨言论检测
Fair Hate Speech Detection through Evaluation of Social Group Counterfactuals
论文作者
论文摘要
减轻监督模型中偏见的方法旨在减少模型对输入数据的特定敏感特征的依赖,例如上述社会群体。但是,在仇恨言论检测的情况下,由于它们在区分外群的仇恨中起着至关重要的作用,并不总是需要均等的,因此,特定类型的仇恨言论只有在某些社会群体围绕某些社会群体上的情境中才具有预期的意义。提到的社会群体的反事实令牌公平性评估了模型的预测,即对(a)实际句子和(b)反事实实例是否相同,这是通过更改句子中提到的社交群体而产生的。我们的方法确保对反事实的强大模型预测与实际句子相似的含义。为了量化句子及其反事实的相似性,我们比较了它们通过生成语言模型计算得出的可能性得分。通过在每个句子及其反事实上平衡模型行为,我们可以减轻所提出的模型的偏见,同时保留整体分类性能。
Approaches for mitigating bias in supervised models are designed to reduce models' dependence on specific sensitive features of the input data, e.g., mentioned social groups. However, in the case of hate speech detection, it is not always desirable to equalize the effects of social groups because of their essential role in distinguishing outgroup-derogatory hate, such that particular types of hateful rhetoric carry the intended meaning only when contextualized around certain social group tokens. Counterfactual token fairness for a mentioned social group evaluates the model's predictions as to whether they are the same for (a) the actual sentence and (b) a counterfactual instance, which is generated by changing the mentioned social group in the sentence. Our approach assures robust model predictions for counterfactuals that imply similar meaning as the actual sentence. To quantify the similarity of a sentence and its counterfactual, we compare their likelihood score calculated by generative language models. By equalizing model behaviors on each sentence and its counterfactuals, we mitigate bias in the proposed model while preserving the overall classification performance.