论文标题

克伦:切诺基 - 英语的机器翻译,用于濒危语言振兴

ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization

论文作者

Zhang, Shiyue, Frey, Benjamin, Bansal, Mohit

论文摘要

切诺基是切诺基人使用的一种极度濒危的美国原住民语言。切诺基文化深深地嵌入其语言中。但是,世界上只有大约有2,000名流利的Cherokee演讲者,每年的数字正在下降。为了节省这种濒临灭绝的语言,我们介绍了切诺基 - 英语平行数据集Chren,以促进切诺基和英语之间的机器翻译研究。与一些流行的机器翻译语言对相比,Chren的资源极低,总共包含14K句子对。我们以促进内域和室外评估的方式将平行数据分开。我们还收集了5K切诺基单语言数据,以启用半监督学习。除了这些数据集外,我们还提出了几个切诺基 - 英语和英文智商翻译系统。我们比较SMT(基于短语)与NMT(基于RNN和基于变压器)的系统;监督与半监督(通过语言模型,反向翻译和BERT/MULTIATAINAL-BERT)方法;以及与其他4种语言的转移学习与多语言联合培训。我们的最佳结果是分别为15.8/12.7 BLEU和6.5/5.0 BLEU,分别用于域外Chr-en/Enchr翻译,我们希望我们的数据集和系统能够鼓励社区对切诺基语言的未来工作。我们的数据,代码和演示将在https://github.com/zhangshiyue/chren上公开获取。

Cherokee is a highly endangered Native American language spoken by the Cherokee people. The Cherokee culture is deeply embedded in its language. However, there are approximately only 2,000 fluent first language Cherokee speakers remaining in the world, and the number is declining every year. To help save this endangered language, we introduce ChrEn, a Cherokee-English parallel dataset, to facilitate machine translation research between Cherokee and English. Compared to some popular machine translation language pairs, ChrEn is extremely low-resource, only containing 14k sentence pairs in total. We split our parallel data in ways that facilitate both in-domain and out-of-domain evaluation. We also collect 5k Cherokee monolingual data to enable semi-supervised learning. Besides these datasets, we propose several Cherokee-English and English-Cherokee machine translation systems. We compare SMT (phrase-based) versus NMT (RNN-based and Transformer-based) systems; supervised versus semi-supervised (via language model, back-translation, and BERT/Multilingual-BERT) methods; as well as transfer learning versus multilingual joint training with 4 other languages. Our best results are 15.8/12.7 BLEU for in-domain and 6.5/5.0 BLEU for out-of-domain Chr-En/EnChr translations, respectively, and we hope that our dataset and systems will encourage future work by the community for Cherokee language revitalization. Our data, code, and demo will be publicly available at https://github.com/ZhangShiyue/ChrEn

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源