让我们停止在端到端关系提取中的不正确比较！

论文标题

让我们停止在端到端关系提取中的不正确比较！

Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!

论文作者

Taillé, Bruno, Guigue, Vincent, Scoutheeten, Geoffrey, Gallinari, Patrick

论文摘要

尽管努力区分了三种不同的评估设置（Bekoulis等，2018），但许多端到端关系提取（RE）文章的绩效比较与以前的工作不可靠。在本文中，我们首先确定已发表论文中的几种无效比较模式，并描述它们以避免它们的传播。然后，我们提出了一项小型实证研究，以量化最常见的错误的影响，并评估它会导致ACE05上最终的RE性能高于5％。我们还抓住了这个机会，研究了两个最近的发展的未开发消融：使用训练语言模型（特别是BERT）和跨度级别的使用。这项荟萃分析强调了评估设置和数据集统计数据的报告中的必要性，我们呼吁将评估设置统一端到端RE。

Despite efforts to distinguish three different evaluation setups (Bekoulis et al., 2018), numerous end-to-end Relation Extraction (RE) articles present unreliable performance comparison to previous work. In this paper, we first identify several patterns of invalid comparisons in published papers and describe them to avoid their propagation. We then propose a small empirical study to quantify the impact of the most common mistake and evaluate it leads to overestimating the final RE performance by around 5% on ACE05. We also seize this opportunity to study the unexplored ablations of two recent developments: the use of language model pretraining (specifically BERT) and span-level NER. This meta-analysis emphasizes the need for rigor in the report of both the evaluation setting and the datasets statistics and we call for unifying the evaluation setting in end-to-end RE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题