超越准确性：使用清单的NLP模型的行为测试

论文标题

超越准确性：使用清单的NLP模型的行为测试

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

论文作者

Ribeiro, Marco Tulio, Wu, Tongshuang, Guestrin, Carlos, Singh, Sameer

论文摘要

尽管测量持有的准确性一直是评估概括的主要方法，但它通常高估了NLP模型的性能，而评估模型的替代方法要么关注单个任务或特定行为。受软件工程中行为测试原则的启发，我们介绍了用于测试NLP模型的任务无关方法。清单包括一般语言功能和测试类型的矩阵，可促进全面的测试构想，以及一种软件工具，以快速生成大量和不同数量的测试用例。我们通过对三个任务进行测试来说明清单的实用性，并确定商业和最先进模型中的重大故障。在一项用户研究中，负责商业情感分析模型的团队在经过广泛测试的模型中发现了新的可行的错误。在另一项用户研究中，带有清单的NLP从业人员创建了两倍的测试，并发现了几乎没有该测试的错误，几乎是用户的三倍。

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

下载PDF全文

下载文献需遵守相关版权规定

论文标题