论文标题
提出问题作为自动指标,以评估摘要的内容质量
Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary
论文作者
论文摘要
衡量摘要内容质量的基于参考的评估度量的理想属性是,它应该估计摘要与参考的共同信息。传统的文本基于重叠的指标(例如Rouge)无法实现这一目标,因为它们仅限于词汇或通过嵌入来匹配令牌。在这项工作中,我们提出了一个指标,以评估使用问答(QA)的摘要的内容质量。基于QA的方法直接衡量摘要的信息与参考重叠,从而使它们与文本重叠指标的根本不同。我们通过分析我们提出的指标Qaeval来证明基于质量检查的指标的实验益处。 Qaeval在大多数评估中使用基准数据集的当前最新指标,同时由于最先进的模型的局限性在其他评估中具有竞争力。通过对Qaeval的每个组件进行仔细的分析,我们确定其性能瓶颈,并估计其潜在的上限性能超过了所有其他自动指标,从而接近了金标准金字塔方法。
A desirable property of a reference-based evaluation metric that measures the content quality of a summary is that it should estimate how much information that summary has in common with a reference. Traditional text overlap based metrics such as ROUGE fail to achieve this because they are limited to matching tokens, either lexically or via embeddings. In this work, we propose a metric to evaluate the content quality of a summary using question-answering (QA). QA-based methods directly measure a summary's information overlap with a reference, making them fundamentally different than text overlap metrics. We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval. QAEval out-performs current state-of-the-art metrics on most evaluations using benchmark datasets, while being competitive on others due to limitations of state-of-the-art models. Through a careful analysis of each component of QAEval, we identify its performance bottlenecks and estimate that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.