论文标题
研究自动机器翻译评估中的数据差异
Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics
论文作者
论文摘要
指标评估中的当前实践重点是一个单个数据集,例如,每年的WMT指标共享任务中的Newstest数据集。但是,在本文中,我们对定性和定量表明指标的性能对数据敏感。当评估在不同的数据集上进行评估时,指标的排名会有所不同。然后,本文进一步研究了两个潜在的假设,即微不足道的数据点以及独立和相同分布(I.I.D)假设的偏差,这可能对数据差异问题负责。总之,我们的发现表明,在评估自动翻译指标时,研究人员应考虑数据差异,并谨慎地要求在单个数据集上声称结果,因为它可能导致与大多数其他数据集的结果不一致。
Current practices in metric evaluation focus on one single dataset, e.g., Newstest dataset in each year's WMT Metrics Shared Task. However, in this paper, we qualitatively and quantitatively show that the performances of metrics are sensitive to data. The ranking of metrics varies when the evaluation is conducted on different datasets. Then this paper further investigates two potential hypotheses, i.e., insignificant data points and the deviation of Independent and Identically Distributed (i.i.d) assumption, which may take responsibility for the issue of data variance. In conclusion, our findings suggest that when evaluating automatic translation metrics, researchers should take data variance into account and be cautious to claim the result on a single dataset, because it may leads to inconsistent results with most of other datasets.