论文标题

RESCO-CC:无监督的关键虚假信息句子的识别

ReSCo-CC: Unsupervised Identification of Key Disinformation Sentences

论文作者

Ghosal, Soumya Suvra, P, Deepak, Jurek-Loughrey, Anna

论文摘要

长期文本文章通常会出现虚假信息,尤其是在与健康相关的领域涉及的域时,经常与Covid-19有关。通常观察到这些文章具有许多值得信赖的句子,其中核心虚假句子分散了。在本文中,我们提出了一项新颖的无监督任务,即识别包含文档中关键虚假信息的句子,该句子已知是不可信的。我们为任务设计了一个三相统计NLP解决方案,该解决方案是从嵌入句子的定制功能空间中,为任务设计。然后将使用这些功能表示的句子聚类,然后通过接近评分来标识密钥句子。我们还策划了一个具有句子级别虚假信息得分的新数据集,以帮助评估这项任务;该数据集被公开可用以促进进一步的研究。基于对相关任务(例如索赔检测和摘要)的技术的全面经验评估,以及我们提出的方法的简化变体,我们说明我们的方法能够有效地识别核心虚假信息。

Disinformation is often presented in long textual articles, especially when it relates to domains such as health, often seen in relation to COVID-19. These articles are typically observed to have a number of trustworthy sentences among which core disinformation sentences are scattered. In this paper, we propose a novel unsupervised task of identifying sentences containing key disinformation within a document that is known to be untrustworthy. We design a three-phase statistical NLP solution for the task which starts with embedding sentences within a bespoke feature space designed for the task. Sentences represented using those features are then clustered, following which the key sentences are identified through proximity scoring. We also curate a new dataset with sentence level disinformation scorings to aid evaluation for this task; the dataset is being made publicly available to facilitate further research. Based on a comprehensive empirical evaluation against techniques from related tasks such as claim detection and summarization, as well as against simplified variants of our proposed approach, we illustrate that our method is able to identify core disinformation effectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源