QBSUM：来自现实世界应用程序的大规模查询文档摘要数据集

论文标题

QBSUM：来自现实世界应用程序的大规模查询文档摘要数据集

QBSUM: a Large-Scale Query-Based Document Summarization Dataset from Real-world Applications

论文作者

Zhao, Mingjun, Yan, Shengli, Liu, Bang, Zhong, Xinwang, Hao, Qian, Chen, Haolan, Niu, Di, Long, Bowei, Guo, Weidong

论文摘要

基于查询的文档摘要旨在提取或生成直接回答或与搜索查询有关的文档的摘要。这是一项重要的技术，可以对诸如搜索引擎，文档级机器阅读理解和聊天机器人等各种应用程序有益。当前，为基于查询的摘要而设计的数据集的数字很短，现有数据集也受到限制和质量。此外，据我们所知，没有用于基于中文查询的文档摘要的公开可用数据集。在本文中，我们介绍了QBSUM，这是一个高质量的大规模数据集，该数据集由49,000多个数据示例组成，用于基于中文的文档摘要的任务。我们还针对任务提出了多个无监督和监督的解决方案，并通过离线实验和在线A/B测试来证明其高速推理和卓越的性能。 QBSUM数据集的发布是为了促进该研究领域的未来发展。

Query-based document summarization aims to extract or generate a summary of a document which directly answers or is relevant to the search query. It is an important technique that can be beneficial to a variety of applications such as search engines, document-level machine reading comprehension, and chatbots. Currently, datasets designed for query-based summarization are short in numbers and existing datasets are also limited in both scale and quality. Moreover, to the best of our knowledge, there is no publicly available dataset for Chinese query-based document summarization. In this paper, we present QBSUM, a high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization. We also propose multiple unsupervised and supervised solutions to the task and demonstrate their high-speed inference and superior performance via both offline experiments and online A/B tests. The QBSUM dataset is released in order to facilitate future advancement of this research field.

下载PDF全文

下载文献需遵守相关版权规定

论文标题