原始还是翻译？关于使用并联数据进行翻译质量估计

论文标题

原始还是翻译？关于使用并联数据进行翻译质量估计

Original or Translated? On the Use of Parallel Data for Translation Quality Estimation

论文作者

Qiu, Baopu, Ding, Liang, Wu, Di, Shang, Lin, Zhan, Yibing, Tao, Dacheng

论文摘要

机器翻译质量估计（QE）是在没有人写的参考文献的情况下评估翻译输出的任务。由于人类标记的量化宽松数据的稀缺性，先前的著作试图利用丰富的未标记的平行语料库来生产带有伪标签的其他培训数据。在本文中，我们证明了并行数据与真实量化量化数据之间的显着差距：对于量化宽松数据，严格保证源侧是原始文本，而目标侧是翻译（即翻译）。但是，对于并行数据，它是不分青红皂白的，并且翻译可能发生在源或目标侧。我们将并行数据与量化量化数据增强中的不同翻译方向的影响进行比较，并发现使用并行语料库的源原始部分始终优于其目标原始对应物。此外，由于WMT语料库缺乏每个平行句子的方向信息，因此我们训练分类器以区分源和目标bitext，并对它们在样式和域上的差异进行分析。总之，这些发现表明，与句子和单词级量化量化量化量化强的数据相比，使用源源原始并行数据以进行量化量化量化量化数据增强，从而使高达4.0％和6.4％的相对提高高达4.0％和6.4％。

Machine Translation Quality Estimation (QE) is the task of evaluating translation output in the absence of human-written references. Due to the scarcity of human-labeled QE data, previous works attempted to utilize the abundant unlabeled parallel corpora to produce additional training data with pseudo labels. In this paper, we demonstrate a significant gap between parallel data and real QE data: for QE data, it is strictly guaranteed that the source side is original texts and the target side is translated (namely translationese). However, for parallel data, it is indiscriminate and the translationese may occur on either source or target side. We compare the impact of parallel data with different translation directions in QE data augmentation, and find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart. Moreover, since the WMT corpus lacks direction information for each parallel sentence, we train a classifier to distinguish source- and target-original bitext, and carry out an analysis of their difference in both style and domain. Together, these findings suggest using source-original parallel data for QE data augmentation, which brings a relative improvement of up to 4.0% and 6.4% compared to undifferentiated data on sentence- and word-level QE tasks respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题