重新思考和完善独特的指标

论文标题

重新思考和完善独特的指标

Rethinking and Refining the Distinct Metric

论文作者

Liu, Siyang, Sabour, Sahand, Zheng, Yinhe, Ke, Pei, Zhu, Xiaoyan, Huang, Minlie

论文摘要

独特的 - $ n $得分\ cite {li2016}是一种广泛使用的自动指标，用于评估语言生成任务的多样性。但是，我们观察到，计算不同分数的原始方法具有明显的偏见，这些偏见倾向于为更长的序列分配较高的惩罚。我们通过根据他们的期望来缩放不同令牌的数量来完善不同分数的计算。我们提供经验和理论证据，以表明我们的方法有效地消除了原始不同分数中存在的偏见。我们的实验表明，我们提出的指标，\ textit {期望调整了独特的（ead）}，与人类判断更好地评估了响应多样性。为了培养未来的研究，我们在\ url {https://github.com/lsy641/expectation-achjusted-distinct}提供了一个示例实现。

Distinct-$n$ score\cite{Li2016} is a widely used automatic metric for evaluating diversity in language generation tasks. However, we observed that the original approach for calculating distinct scores has evident biases that tend to assign higher penalties to longer sequences. We refine the calculation of distinct scores by scaling the number of distinct tokens based on their expectations. We provide both empirical and theoretical evidence to show that our method effectively removes the biases existing in the original distinct score. Our experiments show that our proposed metric, \textit{Expectation-Adjusted Distinct (EAD)}, correlates better with human judgment in evaluating response diversity. To foster future research, we provide an example implementation at \url{https://github.com/lsy641/Expectation-Adjusted-Distinct}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题