从俄罗斯的科学和技术文本中提取实体识别和关系

论文标题

从俄罗斯的科学和技术文本中提取实体识别和关系

Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian

论文作者

Bruches, Elena, Pauls, Alexey, Batura, Tatiana, Isachenko, Vladimir

论文摘要

本文专门研究信息技术科学文本的信息提取方法（实体识别和关系分类）。科学出版物为最先进的科学进步提供了宝贵的信息，但是对越来越多的数据的有效处理是一项耗时的任务。在本文中，提出了一些对俄罗斯语言的方法的修改。它还包括比较关键字提取方法，词汇方法和基于神经网络的某些方法的实验结果。这些任务的文本收集存在于英语，并由科学界积极使用，但目前，俄罗斯的此类数据集尚未公开使用。在本文中，我们介绍了俄罗斯人的科学文本语料库。该数据集由1600个未标记的文档和80个标记的实体和语义关系组成（考虑了6种关系类型）。数据集和模型可在https://github.com/iis-research-team上找到。我们希望它们可以对研究目的和开发信息提取系统有用。

This paper is devoted to the study of methods for information extraction (entity recognition and relation classification) from scientific texts on information technology. Scientific publications provide valuable information into cutting-edge scientific advances, but efficient processing of increasing amounts of data is a time-consuming task. In this paper, several modifications of methods for the Russian language are proposed. It also includes the results of experiments comparing a keyword extraction method, vocabulary method, and some methods based on neural networks. Text collections for these tasks exist for the English language and are actively used by the scientific community, but at present, such datasets in Russian are not publicly available. In this paper, we present a corpus of scientific texts in Russian, RuSERRC. This dataset consists of 1600 unlabeled documents and 80 labeled with entities and semantic relations (6 relation types were considered). The dataset and models are available at https://github.com/iis-research-team. We hope they can be useful for research purposes and development of information extraction systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题