论文标题
学习如何通过韩国人翻译朝鲜人
Learning How to Translate North Korean through South Korean
论文作者
论文摘要
南方和朝鲜都使用朝鲜语。但是,韩国NLP研究仅关注韩国,而现有的韩语NLP系统,例如神经机器翻译(NMT)模型,无法正确处理朝鲜的投入。使用朝鲜数据训练模型是解决此问题的最直接方法,但是没有足够的数据来培训NMT模型。在这项研究中,我们使用可比语料库为朝鲜NMT模型创建数据。首先,我们手动创建用于自动对齐和机器翻译的评估数据。然后,我们研究适合朝鲜人的自动对准方法。最后,我们验证了通过没有人类注释的朝鲜双语数据训练的模型可以显着提高朝鲜的翻译准确性,而零拍摄环境中现有的韩国模型相比。
South and North Korea both use the Korean language. However, Korean NLP research has focused on South Korean only, and existing NLP systems of the Korean language, such as neural machine translation (NMT) models, cannot properly handle North Korean inputs. Training a model using North Korean data is the most straightforward approach to solving this problem, but there is insufficient data to train NMT models. In this study, we create data for North Korean NMT models using a comparable corpus. First, we manually create evaluation data for automatic alignment and machine translation. Then, we investigate automatic alignment methods suitable for North Korean. Finally, we verify that a model trained by North Korean bilingual data without human annotation can significantly boost North Korean translation accuracy compared to existing South Korean models in zero-shot settings.