提出问题作为自动指标，以评估摘要的内容质量

论文标题

提出问题作为自动指标，以评估摘要的内容质量

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

论文作者

Deutsch, Daniel, Bedrax-Weiss, Tania, Roth, Dan

论文摘要

衡量摘要内容质量的基于参考的评估度量的理想属性是，它应该估计摘要与参考的共同信息。传统的文本基于重叠的指标（例如Rouge）无法实现这一目标，因为它们仅限于词汇或通过嵌入来匹配令牌。在这项工作中，我们提出了一个指标，以评估使用问答（QA）的摘要的内容质量。基于QA的方法直接衡量摘要的信息与参考重叠，从而使它们与文本重叠指标的根本不同。我们通过分析我们提出的指标Qaeval来证明基于质量检查的指标的实验益处。 Qaeval在大多数评估中使用基准数据集的当前最新指标，同时由于最先进的模型的局限性在其他评估中具有竞争力。通过对Qaeval的每个组件进行仔细的分析，我们确定其性能瓶颈，并估计其潜在的上限性能超过了所有其他自动指标，从而接近了金标准金字塔方法。

A desirable property of a reference-based evaluation metric that measures the content quality of a summary is that it should estimate how much information that summary has in common with a reference. Traditional text overlap based metrics such as ROUGE fail to achieve this because they are limited to matching tokens, either lexically or via embeddings. In this work, we propose a metric to evaluate the content quality of a summary using question-answering (QA). QA-based methods directly measure a summary's information overlap with a reference, making them fundamentally different than text overlap metrics. We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval. QAEval out-performs current state-of-the-art metrics on most evaluations using benchmark datasets, while being competitive on others due to limitations of state-of-the-art models. Through a careful analysis of each component of QAEval, we identify its performance bottlenecks and estimate that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题