论文标题
了解跨语性摘要中的翻译
Understanding Translationese in Cross-Lingual Summarization
论文作者
论文摘要
给定源语言的文档,跨语性摘要(CLS)旨在以不同的目标语言生成简明的摘要。与单语摘要(MS)不同,很少见与目标语言摘要配对的天然源语言文档。为了收集大规模CLS数据,现有数据集通常涉及其创建中的翻译。但是,翻译的文本与最初用该语言(即翻译人员)的文本区分开来。在本文中,我们首先确认构建CLS数据集的不同方法将导致不同程度的翻译人员。然后,我们系统地研究翻译人员在源文档或目标摘要中出现时如何影响CLS模型评估和性能。详细说明,(1)文档或测试集摘要中的翻译可能会导致人类判断力和自动评估之间的差异; (2)培训集中的翻译人员将损害现实应用程序中的模型性能; (3)尽管机器翻译的文档涉及翻译,但它们对于在特定的培训策略下在低资源语言上构建CLS系统非常有用。最后,我们为未来的CLS研究提供了建议,包括数据集和模型开发。我们希望我们的工作能够让研究人员注意到CLS中翻译的现象,并在将来考虑到它。
Given a document in a source language, cross-lingual summarization (CLS) aims at generating a concise summary in a different target language. Unlike monolingual summarization (MS), naturally occurring source-language documents paired with target-language summaries are rare. To collect large-scale CLS data, existing datasets typically involve translation in their creation. However, the translated text is distinguished from the text originally written in that language, i.e., translationese. In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese. Then we systematically investigate how translationese affects CLS model evaluation and performance when it appears in source documents or target summaries. In detail, we find that (1) the translationese in documents or summaries of test sets might lead to the discrepancy between human judgment and automatic evaluation; (2) the translationese in training sets would harm model performance in real-world applications; (3) though machine-translated documents involve translationese, they are very useful for building CLS systems on low-resource languages under specific training strategies. Lastly, we give suggestions for future CLS research including dataset and model developments. We hope that our work could let researchers notice the phenomenon of translationese in CLS and take it into account in the future.