Mega-COV：Covid-19

论文标题

Mega-COV：Covid-19

Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

论文作者

Abdul-Mageed, Muhammad, Elmadany, AbdelRahim, Nagoudi, El Moatez Billah, Pabbi, Dinesh, Verma, Kunal, Lin, Rannie

论文摘要

我们描述了来自Twitter的十亿个尺度数据集Mega-COV，用于研究Covid-19。该数据集是多种多样的（涵盖268个国家），纵向（回到2007年），多语言（有100多种语言），并具有大量的位置标签推文（约1.69亿推文）。我们从数据集中发布推文ID。我们还开发并发布了两个强大的模型，一种用于确定一条推文是否与大流行有关（最佳F1 = 97％），另一个用于检测有关Covid-19的错误信息（最佳F1 = 92％）。人类注释研究揭示了我们模型在Mega-COV的子集中的实用性。我们的数据和模型可用于研究与大流行有关的广泛现象。 Mega-COV和我们的模型公开可用。

We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop and release two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题

Mega-COV：Covid-19

Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

论文作者

论文摘要

加入微信交流群