用印度尼西亚的子词分离器减少印尼词汇

论文标题

用印度尼西亚的子词分离器减少印尼词汇

Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator

论文作者

Amien, Mukhlis, Chong, Feng, Heyan, Huang

论文摘要

印尼语是一种凝结的语言，因为它具有单词形成的复合过程。因此，该语言的翻译模型需要一种甚至低于单词级别的机制，称为子字级别。自词汇量爆炸以来，这个复合过程导致了一个罕见的单词问题。我们提出了一种策略，以解决神经机器翻译（NMT）系统的独特单词问题，该系统将印度尼西亚语用作一对语言。我们的方法使用一种基于规则的方法将单词转换为其根部并伴有词缀以保留其含义和上下文。使用基于规则的算法具有更大的优势：它不需要语料库数据，而仅应用标准的印尼规则。我们的实验证实了这种方法是实用的。它将词汇量的数量大大减少到57％，而在英语到印度尼西亚的翻译上，此策略在不使用此技术的类似NMT系统上提供了多达5个BLEU点的改进。

Indonesian is an agglutinative language since it has a compounding process of word-formation. Therefore, the translation model of this language requires a mechanism that is even lower than the word level, referred to as the sub-word level. This compounding process leads to a rare word problem since the number of vocabulary explodes. We propose a strategy to address the unique word problem of the neural machine translation (NMT) system, which uses Indonesian as a pair language. Our approach uses a rule-based method to transform a word into its roots and accompanied affixes to retain its meaning and context. Using a rule-based algorithm has more advantages: it does not require corpus data but only applies the standard Indonesian rules. Our experiments confirm that this method is practical. It reduces the number of vocabulary significantly up to 57\%, and on the English to Indonesian translation, this strategy provides an improvement of up to 5 BLEU points over a similar NMT system that does not use this technique.

下载PDF全文

下载文献需遵守相关版权规定

论文标题