论文标题
朝鲜特定的数据集回答
Korean-Specific Dataset for Table Question Answering
论文作者
论文摘要
现有的问题回答系统主要集中于处理文本数据。但是,每天生产的许多数据都以表格和关系数据库或网络中的表格形式存储。为了解决对表的问题回答的任务,有许多数据集用于用英语编写的桌面问题,但很少有韩语数据集。在本文中,我们演示了如何构建韩国特定的数据集以进行表问题回答:韩国表格数据集是140万表的集合,其中包含相应的描述,用于无监督的预培训语言模型。韩国桌子问答语料库由众包工人创建的70,000对问题和答案组成。随后,我们基于变压器构建了一个预训练的语言模型,并对这些数据集回答的表问题进行了微调。然后,我们报告模型的评估结果。我们通过GITHUB存储库公开提供数据集,并希望这些数据集将有助于进一步的研究,以回答表格上的问题,以及对表格格式的转换。
Existing question answering systems mainly focus on dealing with text data. However, much of the data produced daily is stored in the form of tables that can be found in documents and relational databases, or on the web. To solve the task of question answering over tables, there exist many datasets for table question answering written in English, but few Korean datasets. In this paper, we demonstrate how we construct Korean-specific datasets for table question answering: Korean tabular dataset is a collection of 1.4M tables with corresponding descriptions for unsupervised pre-training language models. Korean table question answering corpus consists of 70k pairs of questions and answers created by crowd-sourced workers. Subsequently, we then build a pre-trained language model based on Transformer and fine-tune the model for table question answering with these datasets. We then report the evaluation results of our model. We make our datasets publicly available via our GitHub repository and hope that those datasets will help further studies for question answering over tables, and for the transformation of table formats.