论文标题
重新思考基于梯度的归因方法用于模型可解释性的作用
Rethinking the Role of Gradient-Based Attribution Methods for Model Interpretability
论文作者
论文摘要
当前的判别性深神经网络可解释性的方法通常依赖于模型的输入梯度,即输出逻辑的梯度W.R.T.输入。常见的假设是,这些输入梯度包含有关$p_θ(y \ mid x)$的信息,该模型的判别能力,从而证明其用于解释性的合理性。但是,在这项工作中,我们表明这些输入梯度可以任意操纵,这是由于软磁性的转移不变而不会改变判别函数。这留下了一个空旷的问题:如果输入梯度可以任意,为什么它们在标准模型中高度结构化和解释? 我们通过将标准基于软马克斯的分类器的ligits重新解释为数据分布的非均值对数密度来研究这一点,并表明输入梯度可以被视为歧视模型中类别条件密度模型$p_θ(x \ mid y)$隐含的梯度的梯度。这导致我们假设输入梯度的高度结构化和解释性可能是由于此类条件模型$p_θ(x \ mid y)$与地面真相数据分布$ p _ {\ text {data data}}(x \ mid y)$的对齐。我们通过研究密度比对对梯度解释的影响来检验这一假设。为了实现这种对齐方式,我们使用得分匹配,并提出对该算法的新近似值,以实现训练大规模模型。 我们的实验表明,将隐式密度模型与数据分布的一致性提高对齐增强了梯度结构和解释能力,同时减少这种比对具有相反的效果。总体而言,我们发现输入梯度捕获有关隐式生成模型的信息意味着我们需要重新思考它们用于解释歧视模型的使用。
Current methods for the interpretability of discriminative deep neural networks commonly rely on the model's input-gradients, i.e., the gradients of the output logits w.r.t. the inputs. The common assumption is that these input-gradients contain information regarding $p_θ ( y \mid x)$, the model's discriminative capabilities, thus justifying their use for interpretability. However, in this work we show that these input-gradients can be arbitrarily manipulated as a consequence of the shift-invariance of softmax without changing the discriminative function. This leaves an open question: if input-gradients can be arbitrary, why are they highly structured and explanatory in standard models? We investigate this by re-interpreting the logits of standard softmax-based classifiers as unnormalized log-densities of the data distribution and show that input-gradients can be viewed as gradients of a class-conditional density model $p_θ(x \mid y)$ implicit within the discriminative model. This leads us to hypothesize that the highly structured and explanatory nature of input-gradients may be due to the alignment of this class-conditional model $p_θ(x \mid y)$ with that of the ground truth data distribution $p_{\text{data}} (x \mid y)$. We test this hypothesis by studying the effect of density alignment on gradient explanations. To achieve this alignment we use score-matching, and propose novel approximations to this algorithm to enable training large-scale models. Our experiments show that improving the alignment of the implicit density model with the data distribution enhances gradient structure and explanatory power while reducing this alignment has the opposite effect. Overall, our finding that input-gradients capture information regarding an implicit generative model implies that we need to re-think their use for interpreting discriminative models.