解释文本分类器的必要性和充分性：仇恨言语检测中的案例研究

论文标题

解释文本分类器的必要性和充分性：仇恨言语检测中的案例研究

Necessity and Sufficiency for Explaining Text Classifiers: A Case Study in Hate Speech Detection

论文作者

Balkir, Esma, Nejadgholi, Isar, Fraser, Kathleen C., Kiritchenko, Svetlana

论文摘要

我们提出了一种新颖的功能归因方法，用于解释文本分类器，并在仇恨言语检测的背景下进行分析。尽管特征归因模型通常为每个令牌提供一个重要的得分，但我们提供了两个互补和理论上的分数 - 必要性和足够的分数，从而提供了更多信息的解释。我们提出了一种透明的方法，该方法通过生成输入文本的显式扰动来计算这些值，从而可以解释重要性。我们采用我们的方法来解释不同仇恨言语检测模型的预测，从测试套件的同一组策划的示例中，并表明身份术语的必要性和充分性值对应于不同种类的假阳性错误，从而暴露了分类器偏见对边缘化组的源。

We present a novel feature attribution method for explaining text classifiers, and analyze it in the context of hate speech detection. Although feature attribution models usually provide a single importance score for each token, we instead provide two complementary and theoretically-grounded scores -- necessity and sufficiency -- resulting in more informative explanations. We propose a transparent method that calculates these values by generating explicit perturbations of the input text, allowing the importance scores themselves to be explainable. We employ our method to explain the predictions of different hate speech detection models on the same set of curated examples from a test suite, and show that different values of necessity and sufficiency for identity terms correspond to different kinds of false positive errors, exposing sources of classifier bias against marginalized groups.

下载PDF全文

下载文献需遵守相关版权规定

论文标题