RECO：大规模的中文阅读理解数据集

论文标题

RECO：大规模的中文阅读理解数据集

ReCO: A Large Scale Chinese Reading Comprehension Dataset on Opinion

论文作者

BingningWang, Yao, Ting, Zhang, Qi, Xu, Jingfang, Wang, Xiaochuan

论文摘要

本文介绍了Reco，这是一种人类策划的Chinesereading Gronstension Dataset。 RECO中的问题是向商业搜索引擎发出的基于意见的查询。这些段落是由从检索到的文件中提取支持摘要的人群工人提供的。最后，人群工人给出了抽象的是/否/不确定的答案。 RECO的发布包括30万个问题，据我们所知，中国阅读理解中最大的问题。 RECO的一个突出特征是，除了原始上下文段落外，我们还提供了可以直接用于回答问题的支持证据。质量分析证明了RECO的挑战，需要各种类型的推理技能，例如因果推理，逻辑推理等。当前在许多问题回答问题（例如BERT）上表现出色的QA模型（例如BERT）仅在该数据集上实现77％的准确性，这是人类近92％的差距，这是人类近92％的差距，表明RECO对机器阅读的良好挑战表示了良好的机器阅读理解率的挑战。代码，数据集可在https://github.com/benywon/reco中免费获得。

This paper presents the ReCO, a human-curated ChineseReading Comprehension dataset on Opinion. The questions in ReCO are opinion based queries issued to the commercial search engine. The passages are provided by the crowdworkers who extract the support snippet from the retrieved documents. Finally, an abstractive yes/no/uncertain answer was given by the crowdworkers. The release of ReCO consists of 300k questions that to our knowledge is the largest in Chinese reading comprehension. A prominent characteristic of ReCO is that in addition to the original context paragraph, we also provided the support evidence that could be directly used to answer the question. Quality analysis demonstrates the challenge of ReCO that requires various types of reasoning skills, such as causal inference, logical reasoning, etc. Current QA models that perform very well on many question answering problems, such as BERT, only achieve 77% accuracy on this dataset, a large margin behind humans nearly 92% performance, indicating ReCO presents a good challenge for machine reading comprehension. The codes, datasets are freely available at https://github.com/benywon/ReCO.

下载PDF全文

下载文献需遵守相关版权规定

论文标题