论文标题
部分可观测时空混沌系统的无模型预测
Active Keyword Selection to Track Evolving Topics on Twitter
论文作者
论文摘要
我们如何在大规模上研究有关不断发展的主题的社交互动?在过去的十年中,来自经济学,政治学和公共卫生等各个领域的研究人员经常通过用手工挑选的主题关键字来搜索或流式讨论来查询Twitter的公共API端点。但是,尽管API的可访问性,但很难选择和更新关键字来收集与感兴趣主题相关的高质量数据。在本文中,我们提出了一种积极的学习方法,用于快速完善查询关键字,以增加产量的主题相关性和数据集大小。我们利用大型开源Covid-19 Twitter数据集说明我们方法在跟踪围绕疫苗,掩码和锁定的关键子主题的推文时的适用性。我们的实验表明,我们的方法达到了平均与主题相关的关键字回忆比基线高2倍。我们开放代码以及用于关键字选择的Web界面,以使Twitter的数据收集更加系统地为研究人员进行系统。
How can we study social interactions on evolving topics at a mass scale? Over the past decade, researchers from diverse fields such as economics, political science, and public health have often done this by querying Twitter's public API endpoints with hand-picked topical keywords to search or stream discussions. However, despite the API's accessibility, it remains difficult to select and update keywords to collect high-quality data relevant to topics of interest. In this paper, we propose an active learning method for rapidly refining query keywords to increase both the yielded topic relevance and dataset size. We leverage a large open-source COVID-19 Twitter dataset to illustrate the applicability of our method in tracking Tweets around the key sub-topics of Vaccine, Mask, and Lockdown. Our experiments show that our method achieves an average topic-related keyword recall 2x higher than baselines. We open-source our code along with a web interface for keyword selection to make data collection from Twitter more systematic for researchers.