论文标题
关于释义的评估指标
On the Evaluation Metrics for Paraphrase Generation
论文作者
论文摘要
在本文中,我们重新访问了释义评估的自动指标,并获得了两种不服从传统智慧的发现:(1)无参考指标的性能比基于参考的对应物更好。 (2)最常用的指标不能与人类注释很好地保持一致。通过其他实验和深入分析来探讨上述发现背后的根本原因。基于实验和分析,我们提出了Parascore,这是一种新的评估度量标准。它具有基于参考和无参考指标的优点,并明确模型词汇差异。实验结果表明,帕斯西尔显着胜过现有指标。
In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are explored through additional experiments and in-depth analyses. Based on the experiments and analyses, we propose ParaScore, a new evaluation metric for paraphrase generation. It possesses the merits of reference-based and reference-free metrics and explicitly models lexical divergence. Experimental results demonstrate that ParaScore significantly outperforms existing metrics.