论文标题

乌拉尔语言识别(ULI)2020共享任务数据集和Wanca 2017语料库

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus

论文作者

Jauhiainen, Tommi, Jauhiainen, Heidi, Partanen, Niko, Lindén, Krister

论文摘要

本文介绍了从互联网上爬出的Wanca 2017年文本语料库,从中收集了罕见的乌拉尔语言句子以使用乌拉尔语言识别(ULI)2020共享任务。我们描述了ULI数据集以及它是如何使用Wanca 2017语料库构建的,以及莱比锡Corpora Collection的不同语言的文本。我们还提供了使用ULI 2020数据集进行的基线语言识别实验。

This article introduces the Wanca 2017 corpus of texts crawled from the internet from which the sentences in rare Uralic languages for the use of the Uralic Language Identification (ULI) 2020 shared task were collected. We describe the ULI dataset and how it was constructed using the Wanca 2017 corpus and texts in different languages from the Leipzig corpora collection. We also provide baseline language identification experiments conducted using the ULI 2020 dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源