用K最近的邻居表示解释和改善模型行为

论文标题

用K最近的邻居表示解释和改善模型行为

Explaining and Improving Model Behavior with k Nearest Neighbor Representations

论文作者

Rajani, Nazneen Fatema, Krause, Ben, Yin, Wengpeng, Niu, Tong, Socher, Richard, Xiong, Caiming

论文摘要

NLP中的可解释性技术主要集中在使用注意力图或基于梯度的显着性图上理解个人预测。我们建议使用K最近的邻居（KNN）表示来确定负责模型预测的训练示例，并获得对模型行为的语料库级别的理解。除了可解释性外，我们还表明，KNN表示有效地发现了学识渊博的虚假关联，确定了标记错误的示例并改善了微调模型的性能。我们专注于自然语言推断（NLI）作为案例研究，并对多个数据集进行了实验。我们的方法在示例中以低模型置信度的示例部署了向KNN的退缩，而没有对模型参数进行任何更新。我们的结果表明，KNN方法使固定模型对对抗性输入更加健壮。

Interpretability techniques in NLP have mainly focused on understanding individual predictions using attention visualization or gradient-based saliency maps over tokens. We propose using k nearest neighbor (kNN) representations to identify training examples responsible for a model's predictions and obtain a corpus-level understanding of the model's behavior. Apart from interpretability, we show that kNN representations are effective at uncovering learned spurious associations, identifying mislabeled examples, and improving the fine-tuned model's performance. We focus on Natural Language Inference (NLI) as a case study and experiment with multiple datasets. Our method deploys backoff to kNN for BERT and RoBERTa on examples with low model confidence without any update to the model parameters. Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题