论文标题
数据科学在伊拉克发现与暴力有关的问题
Application of Data Science to Discover Violence-Related Issues in Iraq
论文作者
论文摘要
数据科学已被令人满意地用于发现世界上几个地区的社会问题。但是,缺乏政府公开数据来发现伊拉克等国家的这些问题。这种情况出现了以下问题:尽管伊拉克缺乏开放数据,但如何应用数据科学原则来发现社会问题?如何使用可用数据在没有数据的地方进行预测?我们的贡献是数据科学在全球事件,语言和语调(GDELT)中开放非政府大数据来发现伊拉克中与暴力相关的特定社会问题的非政府大数据。具体而言,我们应用了K-Neart的邻居,NäiveBayes,决策树和Logistic回归分类算法来发现以下问题:难民,人道主义援助,暴力抗议,与炮兵和坦克的战斗以及大规模杀戮。通过决策树算法获得了最佳结果,以发现具有难民危机和炮兵战斗的地区。这两个事件的准确性为0.7629。发现难民危机位置的精度为0.76,召回为0.76,F1得分为0.76。此外,我们的方法还发现了炮兵战斗的位置,精度为0.74,召回0.75,F1得分为0.75。
Data science has been satisfactorily used to discover social issues in several parts of the world. However, there is a lack of governmental open data to discover those issues in countries such as Iraq. This situation arises the following questions: how to apply data science principles to discover social issues despite the lack of open data in Iraq? How to use the available data to make predictions in places without data? Our contribution is the application of data science to open non-governmental big data from the Global Database of Events, Language, and Tone (GDELT) to discover particular violence-related social issues in Iraq. Specifically we applied the K-Nearest Neighbors, Näive Bayes, Decision Trees, and Logistic Regression classification algorithms to discover the following issues: refugees, humanitarian aid, violent protests, fights with artillery and tanks, and mass killings. The best results were obtained with the Decision Trees algorithm to discover areas with refugee crises and artillery fights. The accuracy for these two events is 0.7629. The precision to discover the locations of refugee crises is 0.76, the recall is 0.76, and the F1-score is 0.76. Also, our approach discovers the locations of artillery fights with a precision of 0.74, a recall of 0.75, and a F1-score of 0.75.