论文标题

Mega-COV:Covid-19

Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

论文作者

Abdul-Mageed, Muhammad, Elmadany, AbdelRahim, Nagoudi, El Moatez Billah, Pabbi, Dinesh, Verma, Kunal, Lin, Rannie

论文摘要

我们描述了来自Twitter的十亿个尺度数据集Mega-COV,用于研究Covid-19。该数据集是多种多样的(涵盖268个国家),纵向(回到2007年),多语言(有100多种语言),并具有大量的位置标签推文(约1.69亿推文)。我们从数据集中发布推文ID。我们还开发并发布了两个强大的模型,一种用于确定一条推文是否与大流行有关(最佳F1 = 97%),另一个用于检测有关Covid-19的错误信息(最佳F1 = 92%)。人类注释研究揭示了我们模型在Mega-COV的子集中的实用性。我们的数据和模型可用于研究与大流行有关的广泛现象。 Mega-COV和我们的模型公开可用。

We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop and release two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源