论文标题

Wikipedia时事门户网站的大规模多文件摘要数据集

A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

论文作者

Ghalandari, Demian Gholipour, Hokamp, Chris, Pham, Nghia The, Glover, John, Ifrim, Georgiana

论文摘要

多文章摘要(MDS)旨在将大量文档集合中的内容压缩为简短的摘要,并在新闻源的故事集群中具有重要的应用,搜索结果的呈现和时间表生成。但是,缺乏数据集,这些数据集实际上以足够大的规模解决此类用例,足以训练该任务的监督模型。这项工作提出了一个针对MD的新数据集,在文档群集总数和单个群集的大小中都很大。我们通过利用Wikipedia时事门户网站(WCEP)来构建此数据集,该门户提供了新闻事件的简洁而中性的人文摘要,并链接到外部源文章。我们还通过在公共爬网存档中寻找相关文章来自动扩展这些源文章。我们为几种最先进的MDS技术提供了数据集和经验结果的定量分析。

Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源