论文标题
基于深网的逐步构造的反事实解释
Counterfactual Explanation Based on Gradual Construction for Deep Networks
论文作者
论文摘要
为了了解深网的黑框特征,反事实解释不仅可以推论输入空间的重要特征,还可以如何修改这些功能以将输入分类为目标类,从而越来越兴趣。深层网络从培训数据集中学到的模式可以通过观察各个类别之间的特征变化来掌握。但是,当前方法执行特征修改,以增加目标类别类别的分类概率,而与深网的内部特征无关。这通常会导致偏离现实数据分布的不明确解释。为了解决这个问题,我们提出了一种反事实解释方法,该方法利用了从培训数据集中学到的统计数据。尤其是,我们通过迭代掩盖和组成步骤逐渐构建解释。掩蔽步骤旨在从输入数据中选择一个重要功能,以归类为目标类。同时,组成步骤旨在通过确保其输出分数接近训练数据的逻辑空间,以优化先前选择的功能,这些分类被归类为目标类。实验结果表明,我们的方法在各种分类数据集上产生了对人类友好的解释,并验证可以通过更少的特征修改来实现此类解释。
To understand the black-box characteristics of deep networks, counterfactual explanation that deduces not only the important features of an input space but also how those features should be modified to classify input as a target class has gained an increasing interest. The patterns that deep networks have learned from a training dataset can be grasped by observing the feature variation among various classes. However, current approaches perform the feature modification to increase the classification probability for the target class irrespective of the internal characteristics of deep networks. This often leads to unclear explanations that deviate from real-world data distributions. To address this problem, we propose a counterfactual explanation method that exploits the statistics learned from a training dataset. Especially, we gradually construct an explanation by iterating over masking and composition steps. The masking step aims to select an important feature from the input data to be classified as a target class. Meanwhile, the composition step aims to optimize the previously selected feature by ensuring that its output score is close to the logit space of the training data that are classified as the target class. Experimental results show that our method produces human-friendly interpretations on various classification datasets and verify that such interpretations can be achieved with fewer feature modification.