论文标题
因果关系:通过反事实语言模型的因果模型解释
CausaLM: Causal Model Explanation Through Counterfactual Language Models
论文作者
论文摘要
理解深层神经网络做出的预测很困难,但对于它们的传播也至关重要。作为所有基于机器学习的方法,它们与培训数据一样好,也可以捕获不需要的偏见。尽管有一些工具可以帮助您了解是否存在这种偏见,但它们不能区分相关性和因果关系,并且可能不适合基于文本的模型以及有关高级语言概念的推理。估计感兴趣概念对给定模型的因果关系效果的关键问题是,该估计需要产生反事实示例,这在现有的一代技术方面具有挑战性。为了弥合这一差距,我们提出了因果关系,这是一种使用反事实语言表示模型来生成因果模型解释的框架。我们的方法基于对深层上下文化嵌入模型的微调,该模型具有从问题的因果图中得出的辅助对手任务。具体而言,我们表明,通过仔细选择辅助对手的预训练任务,诸如伯特之类的语言表示模型可以有效地学习针对给定的感兴趣概念的反事实表示,并用于估算其对模型绩效的真正因果影响。我们方法的副产品是一种语言表示模型,该模型不受测试概念的影响,该模型可用于缓解数据中根深蒂固的不良偏见。
Understanding predictions made by deep neural networks is notoriously difficult, but also crucial to their dissemination. As all machine learning based methods, they are as good as their training data, and can also capture unwanted biases. While there are tools that can help understand whether such biases exist, they do not distinguish between correlation and causation, and might be ill-suited for text-based models and for reasoning about high level language concepts. A key problem of estimating the causal effect of a concept of interest on a given model is that this estimation requires the generation of counterfactual examples, which is challenging with existing generation technology. To bridge that gap, we propose CausaLM, a framework for producing causal model explanations using counterfactual language representation models. Our approach is based on fine-tuning of deep contextualized embedding models with auxiliary adversarial tasks derived from the causal graph of the problem. Concretely, we show that by carefully choosing auxiliary adversarial pre-training tasks, language representation models such as BERT can effectively learn a counterfactual representation for a given concept of interest, and be used to estimate its true causal effect on model performance. A byproduct of our method is a language representation model that is unaffected by the tested concept, which can be useful in mitigating unwanted bias ingrained in the data.