论文标题

大型语言模型仍将穆斯林与独特的暴力行为联系在一起

Debiased Large Language Models Still Associate Muslims with Uniquely Violent Acts

论文作者

Hemmatian, Babak, Varshney, Lav R.

论文摘要

最近的工作表明,与基督徒和印度教徒相比,在提示穆斯林的提示时,GPT-3模型有偏见,旨在产生暴力文本完成。两次预注册的复制尝试,一次是精确的和一个近似的尝试,在GPT-3的最新指示系列版本中仅发现了最弱的偏差,该版本进行了微调,以消除有偏见和有毒的输出。很少观察到暴力完成。然而,其他预注册的实验表明,在提示中使用与宗教相关的通用名称的暴力完成率显着增加,这也揭示了对穆斯林的二阶偏见。来自非暴力领域的穆斯林名人的名字导致了相对较少的暴力完成,这表明获取个性化信息可以使模型无法使用刻板印象。尽管如此,内容分析揭示了宗教特定的暴力主题,其中包含高度冒犯性思想,无论及时格式如何。我们的结果表明,有必要对大型语言模型进行额外的辩护,以解决高阶模式和关联。

Recent work demonstrates a bias in the GPT-3 model towards generating violent text completions when prompted about Muslims, compared with Christians and Hindus. Two pre-registered replication attempts, one exact and one approximate, found only the weakest bias in the more recent Instruct Series version of GPT-3, fine-tuned to eliminate biased and toxic outputs. Few violent completions were observed. Additional pre-registered experiments, however, showed that using common names associated with the religions in prompts yields a highly significant increase in violent completions, also revealing a stronger second-order bias against Muslims. Names of Muslim celebrities from non-violent domains resulted in relatively fewer violent completions, suggesting that access to individualized information can steer the model away from using stereotypes. Nonetheless, content analysis revealed religion-specific violent themes containing highly offensive ideas regardless of prompt format. Our results show the need for additional debiasing of large language models to address higher-order schemas and associations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源