论文标题

诺言没有留言语言落后:建立单语语料库,基准和指示语言的模型

Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

论文作者

Doddapaneni, Sumanth, Aralikatte, Rahul, Ramesh, Gowtham, Goyal, Shreya, Khapra, Mitesh M., Kunchukuttan, Anoop, Kumar, Pratyush

论文摘要

为指示语言建立自然语言理解(NLU)功能,该语言具有超过10亿扬声器的集体演讲者基础绝对至关重要。在这项工作中,我们旨在通过沿3个重要轴(i)单语言Corpora(II)NLU测试程序(III)多种语言LLMS来提高指示语言的NLU功能,重点是指示语言。具体来说,我们策划了最大的单语语料库,即IndiNcorp,20.9b代币涵盖了4个语言家族的24种语言 - 比先前的工作增加了2.3倍,同时支持12种其他语言。接下来,我们创建了一个人类监督的基准,即IndiNxtreme,由涉及20种语言的九种NLU任务组成。在语言和任务中,IndiNxtreme总共包含105个评估集,其中52个是对文献的新贡献。据我们所知,这是为指示语言创建标准基准的首次努力,该基准旨在测试预审前语言模型的多语言零击功能。最后,我们培训了支持所有语言的最先进模型Indiabert V2。该模型在跨语言和任务上平均,在强大的基线上实现了2点的绝对改善。数据和模型可在https://github.com/ai4bharat/indicbert上找到。

Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at https://github.com/AI4Bharat/IndicBERT.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源