大型语言模型仍将穆斯林与独特的暴力行为联系在一起

论文标题

大型语言模型仍将穆斯林与独特的暴力行为联系在一起

Debiased Large Language Models Still Associate Muslims with Uniquely Violent Acts

论文作者

Hemmatian, Babak, Varshney, Lav R.

论文摘要

最近的工作表明，与基督徒和印度教徒相比，在提示穆斯林的提示时，GPT-3模型有偏见，旨在产生暴力文本完成。两次预注册的复制尝试，一次是精确的和一个近似的尝试，在GPT-3的最新指示系列版本中仅发现了最弱的偏差，该版本进行了微调，以消除有偏见和有毒的输出。很少观察到暴力完成。然而，其他预注册的实验表明，在提示中使用与宗教相关的通用名称的暴力完成率显着增加，这也揭示了对穆斯林的二阶偏见。来自非暴力领域的穆斯林名人的名字导致了相对较少的暴力完成，这表明获取个性化信息可以使模型无法使用刻板印象。尽管如此，内容分析揭示了宗教特定的暴力主题，其中包含高度冒犯性思想，无论及时格式如何。我们的结果表明，有必要对大型语言模型进行额外的辩护，以解决高阶模式和关联。

Recent work demonstrates a bias in the GPT-3 model towards generating violent text completions when prompted about Muslims, compared with Christians and Hindus. Two pre-registered replication attempts, one exact and one approximate, found only the weakest bias in the more recent Instruct Series version of GPT-3, fine-tuned to eliminate biased and toxic outputs. Few violent completions were observed. Additional pre-registered experiments, however, showed that using common names associated with the religions in prompts yields a highly significant increase in violent completions, also revealing a stronger second-order bias against Muslims. Names of Muslim celebrities from non-violent domains resulted in relatively fewer violent completions, suggesting that access to individualized information can steer the model away from using stereotypes. Nonetheless, content analysis revealed religion-specific violent themes containing highly offensive ideas regardless of prompt format. Our results show the need for additional debiasing of large language models to address higher-order schemas and associations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题