我们需要另一种可解释的AI方法吗？旨在将HOC XAI后评估方法统一为交互式和多维基准

论文标题

我们需要另一种可解释的AI方法吗？旨在将HOC XAI后评估方法统一为交互式和多维基准

Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark

论文作者

Belaid, Mohamed Karim, Hüllermeier, Eyke, Rabus, Maximilian, Krestel, Ralf

论文摘要

近年来，随着各个国家将解释变成合法权利，可解释的AI（XAI）引起了很多关注。 XAI允许改进模型超出精度度量，例如，调试学习模式并揭开AI的行为的神秘面纱。 XAI的广泛使用带来了新的挑战。一方面，已发表的XAI算法的数量经历了繁荣，从业者很难选择合适的工具。另一方面，一些实验确实强调了数据科学家可以滥用XAI算法并误解其结果的容易性。为了解决比较和正确使用特征重要性XAI算法的问题，我们提出了比较-XAI，该基准是将所有应用于XAI算法应用的所有独家功能测试方法。我们提出了一个选择方案，以从文献中的候选名单非冗余功能测试，即每个针对特定最终用户的要求解释模型的要求。基准测试将评估XAI方法评估的复杂性封装成三个级别的分层评分，即针对三个最终用户组：研究人员，从业人员和XAI中的外行。最详细的水平每次测试提供了一个分数。第二层重组分为五类（保真度，脆弱性，稳定性，简单性和压力测试）。最后一个级别是汇总的可理解性分数，它封装了易于将算法的输出正确解释的易用性。通过快速列出每个ML任务及其当前限制，比较-XAI的交互式用户界面有助于减轻解释XAI结果的错误。基准可在https://karim-53.github.io/cxai/上提供。

In recent years, Explainable AI (xAI) attracted a lot of attention as various countries turned explanations into a legal right. xAI allows for improving models beyond the accuracy metric by, e.g., debugging the learned pattern and demystifying the AI's behavior. The widespread use of xAI brought new challenges. On the one hand, the number of published xAI algorithms underwent a boom, and it became difficult for practitioners to select the right tool. On the other hand, some experiments did highlight how easy data scientists could misuse xAI algorithms and misinterpret their results. To tackle the issue of comparing and correctly using feature importance xAI algorithms, we propose Compare-xAI, a benchmark that unifies all exclusive functional testing methods applied to xAI algorithms. We propose a selection protocol to shortlist non-redundant functional tests from the literature, i.e., each targeting a specific end-user requirement in explaining a model. The benchmark encapsulates the complexity of evaluating xAI methods into a hierarchical scoring of three levels, namely, targeting three end-user groups: researchers, practitioners, and laymen in xAI. The most detailed level provides one score per test. The second level regroups tests into five categories (fidelity, fragility, stability, simplicity, and stress tests). The last level is the aggregated comprehensibility score, which encapsulates the ease of correctly interpreting the algorithm's output in one easy to compare value. Compare-xAI's interactive user interface helps mitigate errors in interpreting xAI results by quickly listing the recommended xAI solutions for each ML task and their current limitations. The benchmark is made available at https://karim-53.github.io/cxai/

下载PDF全文

下载文献需遵守相关版权规定

论文标题