处理用拉丁文脚本写的南亚语言：dakshina数据集

论文标题

处理用拉丁文脚本写的南亚语言：dakshina数据集

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

论文作者

Roark, Brian, Wolf-Sonkin, Lawrence, Kirov, Christo, Mielke, Sabrina J., Johny, Cibu, Demirsahin, Isin, Hall, Keith

论文摘要

本文描述了Dakshina数据集，该数据集是一种新资源，该资源由12种南亚语言的拉丁语和本机脚本中的文本组成。数据集包括每种语言：1）本机脚本Wikipedia文本； 2）罗马化词典； 3）完整的句子并行数据中的本机脚本和基本拉丁字母。我们记录了每种语言中用于准备和选择Wikipedia文本的方法；收集采样词典的有证明的romanizatization；以及本地剧本收藏中判处判决的手动罗马化。我们此外，我们还为数据集所能实现的几个任务提供基线结果，包括单词音译，完整句子音译和本机脚本和罗马化文本的语言建模。关键词：罗马化，音译，南亚语言

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages

下载PDF全文

下载文献需遵守相关版权规定

论文标题