论文标题

PERKEY:波斯语新闻语料库用于钥匙般的提取和发电

PerKey: A Persian News Corpus for Keyphrase Extraction and Generation

论文作者

Doostmohammadi, Ehsan, Bokaei, Mohammad Hadi, Sameti, Hossein

论文摘要

钥匙拼提供了文本非常密集的摘要。这些信息可用于许多自然语言处理任务,例如信息检索和文本摘要。由于先前关于波斯关键字或键形提取的研究尚未发布其数据,因此该领域缺乏人类提取的键形数据集。在本文中,我们介绍了Perkey,这是来自六个波斯新闻网站和代理机构的553K新闻文章的语料库,其质量相对较高的作者提取了钥匙纸,然后将其过滤和清洁以实现更高质量的钥匙声。将所得的数据投入到人类评估中,以确保钥匙拼的质量。我们还测量了不同监督和无监督技术的性能,例如使用精度,召回和F1得分在数据集上的TFIDF,多阶级,KEA等。

Keyphrases provide an extremely dense summary of a text. Such information can be used in many Natural Language Processing tasks, such as information retrieval and text summarization. Since previous studies on Persian keyword or keyphrase extraction have not published their data, the field suffers from the lack of a human extracted keyphrase dataset. In this paper, we introduce PerKey, a corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases, which is then filtered and cleaned to achieve higher quality keyphrases. The resulted data was put into human assessment to ensure the quality of the keyphrases. We also measured the performance of different supervised and unsupervised techniques, e.g. TFIDF, MultipartiteRank, KEA, etc. on the dataset using precision, recall, and F1-score.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源