论文标题
CSFEVER和CTKFACTS:获取捷克数据进行事实验证
CsFEVER and CTKFacts: Acquiring Czech data for fact verification
论文作者
论文摘要
在本文中,我们研究了获取捷克数据进行自动事实检查的几种方法,这是一项通常模型为文本索赔真实性W.R.T.的分类的任务。一个可信赖的基础真理的语料库。我们试图以事实主张的形式收集一组数据,地面真相语料库中的证据及其真实性标签(受支持,驳斥或不足的信息)。作为第一次尝试,我们生成了在Wikipedia copus顶部建造的大规模发烧数据集的捷克版。我们采用机器翻译和文档对齐方式的混合方法;我们提供的方法和工具可以轻松地应用于其他语言。我们讨论了它的弱点和不准确性,为清洁的未来方法提出了一种方法,并发布了127K的翻译,以及此类数据集的版本可靠地适用于自然语言推理任务-CSFever -nli。此外,我们收集了一个新颖的数据集,其中包含3,097项索赔,该数据集使用捷克新闻社的220万篇文章进行注释。我们根据发烧方法介绍了其扩展的注释方法,并且由于基础语料库保留了商业秘密,因此我们还发布了数据集的独立版本,以用于自然语言推理的任务,我们称之为ctkfactsnli。我们分析了两个获得的数据集的虚假提示 - 注释模式导致模型过度拟合。进一步检查了CTKFACTS,以进行通知协议,彻底清洁,并提取了常见注释误差的类型。最后,我们为事实检查管道的所有阶段提供基线模型,并发布NLI数据集,以及我们的注释平台和其他实验数据。
In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of Wikipedia corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses and inaccuracies, propose a future approach for their cleaning and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task - the CsFEVER-NLI. Furthermore, we collect a novel dataset of 3,097 claims, which is annotated using the corpus of 2.2M articles of Czech News Agency. We present its extended annotation methodology based on the FEVER approach, and, as the underlying corpus is kept a trade secret, we also publish a standalone version of the dataset for the task of Natural Language Inference we call CTKFactsNLI. We analyze both acquired datasets for spurious cues - annotation patterns leading to model overfitting. CTKFacts is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline and publish the NLI datasets, as well as our annotation platform and other experimental data.