论文标题

COVID-19-Twitter数据集具有潜在主题,情感和情感属性

COVID-19 Twitter Dataset with Latent Topics, Sentiments and Emotions Attributes

论文作者

Gupta, Raj Kumar, Vishwanath, Ajay, Yang, Yinping

论文摘要

本文描述了一个关于人们的话语的大型全球数据集以及在Twitter平台上对Covid-19的大流行的反应。从2020年1月28日至2022年6月1日,我们使用四个关键字从超过2900万个唯一用户那里收集并处理了超过2.52亿个Twitter帖子:“ Corona”,“ Wuhan”,“ NCOV”和“ COVID”。 Leveraging probabilistic topic modelling and pre-trained machine learning-based emotion recognition algorithms, we labelled each tweet with seventeen attributes, including a) ten binary attributes indicating the tweet's relevance (1) or irrelevance (0) to the top ten detected topics, b) five quantitative emotion attributes indicating the degree of intensity of the valence or sentiment (from 0: extremely negative to 1: extremely positive)恐惧,愤怒,悲伤和幸福情绪的强度程度(从0:根本不是1:极度强烈),c)两个类别属性表明情绪(非常负面,负面,中立或混杂,积极,积极,积极,积极)以及主导的情感(恐惧,愤怒,悲伤,悲伤,幸福,没有特定的特定情感,没有特定的情感)。我们讨论技术有效性,并报告这些属性的描述性统计,其时间分布和地理表示。本文最后讨论了数据集在传播,心理学,公共卫生,经济学和流行病学中的用法。

This paper describes a large global dataset on people's discourse and responses to the COVID-19 pandemic over the Twitter platform. From 28 January 2020 to 1 June 2022, we collected and processed over 252 million Twitter posts from more than 29 million unique users using four keywords: "corona", "wuhan", "nCov" and "covid". Leveraging probabilistic topic modelling and pre-trained machine learning-based emotion recognition algorithms, we labelled each tweet with seventeen attributes, including a) ten binary attributes indicating the tweet's relevance (1) or irrelevance (0) to the top ten detected topics, b) five quantitative emotion attributes indicating the degree of intensity of the valence or sentiment (from 0: extremely negative to 1: extremely positive) and the degree of intensity of fear, anger, sadness and happiness emotions (from 0: not at all to 1: extremely intense), and c) two categorical attributes indicating the sentiment (very negative, negative, neutral or mixed, positive, very positive) and the dominant emotion (fear, anger, sadness, happiness, no specific emotion) the tweet is mainly expressing. We discuss the technical validity and report the descriptive statistics of these attributes, their temporal distribution, and geographic representation. The paper concludes with a discussion of the dataset's usage in communication, psychology, public health, economics, and epidemiology.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源