Scitweets-用于检测科学在线话语的数据集和注释框架

论文标题

Scitweets-用于检测科学在线话语的数据集和注释框架

SciTweets -- A Dataset and Annotation Framework for Detecting Scientific Online Discourse

论文作者

Hafid, Salim, Schellhammer, Sebastian, Bringay, Sandra, Todorov, Konstantin, Dietze, Stefan

论文摘要

作为在线话语的一部分，科学主题，主张和资源越来越多地争论，其中重要的例子包括与19岁或气候变化有关的话语。这既导致了重大的社会影响，又增加了对各个学科的科学在线话语的兴趣。例如，沟通研究旨在更深入地了解科学信息的偏见，质量或传播模式，而计算方法已提出使用NLP和IR技术提取，分类或验证科学主张。但是，目前跨学科的研究既缺乏对科学相关性的各种形式的强大定义，又缺乏适当的基础真理数据来区分它们。 In this work, we contribute (a) an annotation framework and corresponding definitions for different forms of scientific relatedness of online discourse in Tweets, (b) an expert-annotated dataset of 1261 tweets obtained through our labeling framework reaching an average Fleiss Kappa $κ$ of 0.63, (c) a multi-label classifier trained on our data able to detect science-relatedness with 89% F1 and also能够检测不同形式的科学知识（主张，参考）。通过这项工作，我们旨在为开发和评估强大方法以分析科学作为大规模在线话语的一部分而奠定基础。

Scientific topics, claims and resources are increasingly debated as part of online discourse, where prominent examples include discourse related to COVID-19 or climate change. This has led to both significant societal impact and increased interest in scientific online discourse from various disciplines. For instance, communication studies aim at a deeper understanding of biases, quality or spreading pattern of scientific information whereas computational methods have been proposed to extract, classify or verify scientific claims using NLP and IR techniques. However, research across disciplines currently suffers from both a lack of robust definitions of the various forms of science-relatedness as well as appropriate ground truth data for distinguishing them. In this work, we contribute (a) an annotation framework and corresponding definitions for different forms of scientific relatedness of online discourse in Tweets, (b) an expert-annotated dataset of 1261 tweets obtained through our labeling framework reaching an average Fleiss Kappa $κ$ of 0.63, (c) a multi-label classifier trained on our data able to detect science-relatedness with 89% F1 and also able to detect distinct forms of scientific knowledge (claims, references). With this work we aim to lay the foundation for developing and evaluating robust methods for analysing science as part of large-scale online discourse.

下载PDF全文

下载文献需遵守相关版权规定

论文标题