洞穴：促进可解释的分类和汇总covid疫苗问题的数据集

论文标题

洞穴：促进可解释的分类和汇总covid疫苗问题的数据集

CAVES: A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines

论文作者

Poddar, Soham, Samad, Azlaan Mustafa, Mukherjee, Rajdeep, Ganguly, Niloy, Ghosh, Saptarshi

论文摘要

在当前，说服人们接种了Covid-19是一个关键的社会挑战。作为实现这一目标的第一步，许多先前的工作依靠社交媒体分析来了解人们对这些疫苗的特定问题，例如潜在的副作用，无效，政治因素等。尽管有一些数据集将社交媒体帖子广泛地分类为反VAX和Pro-Vax标签，但据我们所知，没有数据集根据帖子中提到的特定反疫苗关注点标记社交媒体帖子。在本文中，我们策划了洞穴，这是第一个大型数据集，其中包含约10K COVID-19的反疫苗推文，这些推文在多标签环境中标记为各种特定的反疫苗问题。这也是第一个为每个标签提供说明的多标签分类数据集。此外，数据集还提供所有推文的班级摘要。我们还在数据集上执行了初步实验，并表明这是一个非常具有挑战性的数据集，用于多标签可解释的分类和推文摘要，这是某些最先进的模型所获得的中等分数可以明显看出的。我们的数据集和代码可在以下网址找到：https：//github.com/sohampoddar26/caves-data

Convincing people to get vaccinated against COVID-19 is a key societal challenge in the present times. As a first step towards this goal, many prior works have relied on social media analysis to understand the specific concerns that people have towards these vaccines, such as potential side-effects, ineffectiveness, political factors, and so on. Though there are datasets that broadly classify social media posts into Anti-vax and Pro-Vax labels, there is no dataset (to our knowledge) that labels social media posts according to the specific anti-vaccine concerns mentioned in the posts. In this paper, we have curated CAVES, the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets. We also perform preliminary experiments on the dataset and show that this is a very challenging dataset for multi-label explainable classification and tweet summarization, as is evident by the moderate scores achieved by some state-of-the-art models. Our dataset and codes are available at: https://github.com/sohampoddar26/caves-data

下载PDF全文

下载文献需遵守相关版权规定

论文标题