论文标题

ai4bharat-indicnlp copus:单语库和指示语言的单词嵌入

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

论文作者

Kunchukuttan, Anoop, Kakwani, Divyanshu, Golla, Satish, C., Gokul N., Bhattacharyya, Avik, Khapra, Mitesh M., Kumar, Pratyush

论文摘要

我们介绍了IndiCnlp语料库,这是一种大规模的通用域语料库,其中包含来自两个语言家族的10种印度语言的27亿个单词。我们分享了对这些语料库培训的预训练的单词嵌入式。我们为9种语言创建新闻文章类别分类数据集来评估嵌入。我们表明,在多个评估任务上,INDICNLP的嵌入式大大优于公开培训的嵌入。我们希望该语料库的可用性能够加速NLP研究。资源可在https://github.com/ai4bharat-indicnlp/indicnlp_corpus上获得。

We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源