论文标题
ai4bharat-indicnlp copus:单语库和指示语言的单词嵌入
AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages
论文作者
论文摘要
我们介绍了IndiCnlp语料库,这是一种大规模的通用域语料库,其中包含来自两个语言家族的10种印度语言的27亿个单词。我们分享了对这些语料库培训的预训练的单词嵌入式。我们为9种语言创建新闻文章类别分类数据集来评估嵌入。我们表明,在多个评估任务上,INDICNLP的嵌入式大大优于公开培训的嵌入。我们希望该语料库的可用性能够加速NLP研究。资源可在https://github.com/ai4bharat-indicnlp/indicnlp_corpus上获得。
We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.