论文标题
使用监督VAE学习不可解释性
Learning Invariances for Interpretability using Supervised VAE
论文作者
论文摘要
我们建议将模型不向导作为解释模型的一种手段。这是由逆向工程原则的动机。如果我们理解一个问题,我们可能会以不可暴力的形式在模型中引入归纳偏见。相反,在解释一个复杂的监督模型时,我们可以研究其不断增长,以了解该模型如何解决问题。为此,我们提出了一种有监督的变异自动编码器(VAE)的形式。至关重要的是,潜在空间中尺寸的一个子集有助于监督任务,从而使其余维度充当滋扰参数。通过仅对滋扰维度进行取样,我们能够生成经历变换的样本,使分类不变,从而揭示了模型的不变。我们的实验结果表明,我们提出的模型在分类方面的能力和不变的样品的产生。最后,我们展示了如何将模型与特征归因方法相结合,可以对模型的决策过程产生更细粒度的了解。
We propose to learn model invariances as a means of interpreting a model. This is motivated by a reverse engineering principle. If we understand a problem, we may introduce inductive biases in our model in the form of invariances. Conversely, when interpreting a complex supervised model, we can study its invariances to understand how that model solves a problem. To this end we propose a supervised form of variational auto-encoders (VAEs). Crucially, only a subset of the dimensions in the latent space contributes to the supervised task, allowing the remaining dimensions to act as nuisance parameters. By sampling solely the nuisance dimensions, we are able to generate samples that have undergone transformations that leave the classification unchanged, revealing the invariances of the model. Our experimental results show the capability of our proposed model both in terms of classification, and generation of invariantly transformed samples. Finally we show how combining our model with feature attribution methods it is possible to reach a more fine-grained understanding about the decision process of the model.