论文标题

使用多个子词来改善英语 - 佩兰托自动化文学翻译质量

Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

论文作者

Poncelas, Alberto, Buts, Jan, Hadley, James, Way, Andy

论文摘要

低资源语言的建筑机器翻译(MT)系统仍然具有挑战性。对于许多语言对,并行数据并非广泛可用,在这种情况下,MT模型无法达到与高资源语言所看到的结果相当的结果。 当数据稀缺时,最佳使用有限的材料至关重要。为此,在本文中,我们建议多次使用相同的并行句子,只能改变单词每次分裂的方式。为此,我们使用多个字节对编码模型,并在其配置中使用了各种合并操作。 在我们的实验中,我们使用此技术来扩展可用的数据并改善涉及低资源语言对的MT系统,即英语 - 埃斯佩兰托。 作为另一个贡献,我们提供了文学领域中的一组英语 - 佩雷氏剂并行数据。

Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with high-resource languages. When data are scarce, it is of paramount importance to make optimal use of the limited material available. To that end, in this paper we propose employing the same parallel sentences multiple times, only changing the way the words are split each time. For this purpose we use several Byte Pair Encoding models, with various merge operations used in their configuration. In our experiments, we use this technique to expand the available data and improve an MT system involving a low-resource language pair, namely English-Esperanto. As an additional contribution, we made available a set of English-Esperanto parallel data in the literary domain.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源