乌拉尔语言识别（ULI）2020共享任务数据集和Wanca 2017语料库

论文标题

乌拉尔语言识别（ULI）2020共享任务数据集和Wanca 2017语料库

Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus

论文作者

Jauhiainen, Tommi, Jauhiainen, Heidi, Partanen, Niko, Lindén, Krister

论文摘要

本文介绍了从互联网上爬出的Wanca 2017年文本语料库，从中收集了罕见的乌拉尔语言句子以使用乌拉尔语言识别（ULI）2020共享任务。我们描述了ULI数据集以及它是如何使用Wanca 2017语料库构建的，以及莱比锡Corpora Collection的不同语言的文本。我们还提供了使用ULI 2020数据集进行的基线语言识别实验。

This article introduces the Wanca 2017 corpus of texts crawled from the internet from which the sentences in rare Uralic languages for the use of the Uralic Language Identification (ULI) 2020 shared task were collected. We describe the ULI dataset and how it was constructed using the Wanca 2017 corpus and texts in different languages from the Leipzig corpora collection. We also provide baseline language identification experiments conducted using the ULI 2020 dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题