论文标题
用于结构化文档翻译的高质量多语言数据集
A High-Quality Multilingual Dataset for Structured Documentation Translation
论文作者
论文摘要
本文为文档域提供了高质量的多语言数据集,以推动有关结构化文本本地化的研究。与广泛使用的数据集不同,我们可以从企业软件平台的在线文档中收集XML结构的并行文本段。这些网页已专业地从英语翻译成16种语言,并由域专家维护,每个语言对可用约100,000个文本段。我们从英语中构建和评估了七种目标语言的翻译模型,并具有几种不同的复制机制和XML受限的光束搜索。我们还尝试了一对非英语对,以表明我们的数据集有可能明确启用$ 17 \ times 16 $翻译设置。我们的实验表明,学习使用XML标签进行翻译可提高翻译精度,而光束搜索准确地生成了XML结构。我们还通过专注于数字单词和命名实体的翻译来讨论使用复制机制的权衡。我们进一步提供了针对现实世界应用的模型输出和人类翻译之间差距的详细分析,包括适合后编辑的性能。
This paper presents a high-quality multilingual dataset for the documentation domain to advance research on localization of structured text. Unlike widely-used datasets for translation of plain text, we collect XML-structured parallel text segments from the online documentation for an enterprise software platform. These Web pages have been professionally translated from English into 16 languages and maintained by domain experts, and around 100,000 text segments are available for each language pair. We build and evaluate translation models for seven target languages from English, with several different copy mechanisms and an XML-constrained beam search. We also experiment with a non-English pair to show that our dataset has the potential to explicitly enable $17 \times 16$ translation settings. Our experiments show that learning to translate with the XML tags improves translation accuracy, and the beam search accurately generates XML structures. We also discuss trade-offs of using the copy mechanisms by focusing on translation of numerical words and named entities. We further provide a detailed human analysis of gaps between the model output and human translations for real-world applications, including suitability for post-editing.