ai4bharat-indicnlp copus：单语库和指示语言的单词嵌入

论文标题

ai4bharat-indicnlp copus：单语库和指示语言的单词嵌入

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

论文作者

Kunchukuttan, Anoop, Kakwani, Divyanshu, Golla, Satish, C., Gokul N., Bhattacharyya, Avik, Khapra, Mitesh M., Kumar, Pratyush

论文摘要

我们介绍了IndiCnlp语料库，这是一种大规模的通用域语料库，其中包含来自两个语言家族的10种印度语言的27亿个单词。我们分享了对这些语料库培训的预训练的单词嵌入式。我们为9种语言创建新闻文章类别分类数据集来评估嵌入。我们表明，在多个评估任务上，INDICNLP的嵌入式大大优于公开培训的嵌入。我们希望该语料库的可用性能够加速NLP研究。资源可在https://github.com/ai4bharat-indicnlp/indicnlp_corpus上获得。

We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.

下载PDF全文

下载文献需遵守相关版权规定

论文标题