论文标题

问题回答数据集的实际注释策略

Practical Annotation Strategies for Question Answering Datasets

论文作者

Kratzwald, Bernhard, Yue, Xiang, Sun, Huan, Feuerriegel, Stefan

论文摘要

注释问题回答(QA)任务的数据集非常昂贵,因为它需要密集的体力劳动和通常特定于领域的知识。然而,以具有成本效益的方式注释质量检查数据集的策略很少。为了为从业者提供补救措施,我们的目标是制定启发式规则以注释一部分问题,以便在保持内部和室外性能的同时降低注释成本。为此,我们进行了大规模分析,以提出实际建议。首先,我们通过实验证明,更多的培训样本通常只会导致更高的内域测试集,但不能帮助模型推广到看不见的数据集。其次,我们制定了模型引导的注释策略:它提出了一个建议,应注释哪些样品子集。基于QA定制为临床环境的域研究,证明了其有效性。在这里,值得注意的是,只有1.2%的原始培训集的分层子集就可以实现97.7%的性能,就好像完整的数据集被注释了。因此,可以大大减少标签工作。总的来说,当标签预算有限时,我们的工作在实践中满足了需求,因此需要提高注释QA数据集的建议更具成本效益。

Annotating datasets for question answering (QA) tasks is very costly, as it requires intensive manual labor and often domain-specific knowledge. Yet strategies for annotating QA datasets in a cost-effective manner are scarce. To provide a remedy for practitioners, our objective is to develop heuristic rules for annotating a subset of questions, so that the annotation cost is reduced while maintaining both in- and out-of-domain performance. For this, we conduct a large-scale analysis in order to derive practical recommendations. First, we demonstrate experimentally that more training samples contribute often only to a higher in-domain test-set performance, but do not help the model in generalizing to unseen datasets. Second, we develop a model-guided annotation strategy: it makes a recommendation with regard to which subset of samples should be annotated. Its effectiveness is demonstrated in a case study based on domain customization of QA to a clinical setting. Here, remarkably, annotating a stratified subset with only 1.2% of the original training set achieves 97.7% of the performance as if the complete dataset was annotated. Hence, the labeling effort can be reduced immensely. Altogether, our work fulfills a demand in practice when labeling budgets are limited and where thus recommendations are needed for annotating QA datasets more cost-effectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源