论文标题

布拉格依赖性树库 - 合并1.0

Prague Dependency Treebank -- Consolidated 1.0

论文作者

Hajič, Jan, Bejček, Eduard, Hlaváčová, Jaroslava, Mikulová, Marie, Straka, Milan, Štěpánek, Jan, Štěpánková, Barbora

论文摘要

我们提供了丰富的注释和流派的语言资源,布拉格依赖性树库 - 固化的1.0(PDT-C 1.0),其目的是 - 就像布拉格依赖性树库家族一样,这始终是一种用于各种NLP任务的培训数据的家族,以及用于语言方面的研究的培训数据。 PDT-C 1.0包含四个不同的捷克数据集,使用标准PDT方案统一注释(尽管并非所有内容都手动注释,正如我们在此处详细描述的)。这些文本来自不同的来源:日报文章,《华尔街日报》的捷克翻译,抄录的对话框以及少量用户生成的,简短的,通常是非标准的语言段,这些语言段分为网络翻译器。 Treebank的形态,表面和深层句法注释总共包含约18万个句子。文本和注释的多样性应很好地为NLP应用提供,并且是语言研究的宝贵资源,包括有关不同类型文本的比较研究。该语料库是公开且免费的。

We present a richly annotated and genre-diversified language resource, the Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0), the purpose of which is - as it always been the case for the family of the Prague Dependency Treebanks - to serve both as a training data for various types of NLP tasks as well as for linguistically-oriented research. PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme (albeit not everything is annotated manually, as we describe in detail here). The texts come from different sources: daily newspaper articles, Czech translation of the Wall Street Journal, transcribed dialogs and a small amount of user-generated, short, often non-standard language segments typed into a web translator. Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation. The diversity of the texts and annotations should serve well the NLP applications as well as it is an invaluable resource for linguistic research, including comparative studies regarding texts of different genres. The corpus is publicly and freely available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源