论文标题

LST20语料库的注释指南

The Annotation Guideline of LST20 Corpus

论文作者

Boonkwan, Prachya, Luantangsrisuk, Vorapon, Phaholphinyo, Sitthaa, Kriengket, Kanyanat, Leenoi, Dhanon, Phrombut, Charun, Boriboon, Monthika, Kosawat, Krit, Supnithi, Thepchai

论文摘要

本报告介绍了LST20的注释指南,该指南是一个大规模语料库,具有多层语言注释用于泰语语言处理。我们的指南包括五层语言注释:单词分割,pos标记,命名实体,子句边界和句子边界。数据集符合Conll-2003风格的格式,以易于使用。 LST20语料库提供了上述五层语言注释。大规模,它由3,164,864个单词,288,020个命名实体,248,962个条款和74,180个句子组成,同时用16个不同的POS标签注释。所有3,745个文件也带有15种新闻类型的注释。关于其纯粹的大小,该数据集被认为足以为NLP开发关节神经模型。由于存在这种公开可用的语料库,泰语首次成为语言上丰富的语言。

This report presents the annotation guideline for LST20, a large-scale corpus with multiple layers of linguistic annotation for Thai language processing. Our guideline consists of five layers of linguistic annotation: word segmentation, POS tagging, named entities, clause boundaries, and sentence boundaries. The dataset complies to the CoNLL-2003-style format for ease of use. LST20 Corpus offers five layers of linguistic annotation as aforementioned. At a large scale, it consists of 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences, while it is annotated with 16 distinct POS tags. All 3,745 documents are also annotated with 15 news genres. Regarding its sheer size, this dataset is considered large enough for developing joint neural models for NLP. With the existence of this publicly available corpus, Thai has become a linguistically rich language for the first time.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源