诺言没有留言语言落后：建立单语语料库，基准和指示语言的模型

论文标题

诺言没有留言语言落后：建立单语语料库，基准和指示语言的模型

Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

论文作者

Doddapaneni, Sumanth, Aralikatte, Rahul, Ramesh, Gowtham, Goyal, Shreya, Khapra, Mitesh M., Kunchukuttan, Anoop, Kumar, Pratyush

论文摘要

为指示语言建立自然语言理解（NLU）功能，该语言具有超过10亿扬声器的集体演讲者基础绝对至关重要。在这项工作中，我们旨在通过沿3个重要轴（i）单语言Corpora（II）NLU测试程序（III）多种语言LLMS来提高指示语言的NLU功能，重点是指示语言。具体来说，我们策划了最大的单语语料库，即IndiNcorp，20.9b代币涵盖了4个语言家族的24种语言 - 比先前的工作增加了2.3倍，同时支持12种其他语言。接下来，我们创建了一个人类监督的基准，即IndiNxtreme，由涉及20种语言的九种NLU任务组成。在语言和任务中，IndiNxtreme总共包含105个评估集，其中52个是对文献的新贡献。据我们所知，这是为指示语言创建标准基准的首次努力，该基准旨在测试预审前语言模型的多语言零击功能。最后，我们培训了支持所有语言的最先进模型Indiabert V2。该模型在跨语言和任务上平均，在强大的基线上实现了2点的绝对改善。数据和模型可在https://github.com/ai4bharat/indicbert上找到。

Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at https://github.com/AI4Bharat/IndicBERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题