论文标题
纠正众包生态数据中的错误分类错误:贝叶斯的观点
Correcting misclassification errors in crowdsourced ecological data: A Bayesian perspective
论文作者
论文摘要
当对流程的直接度量价格昂贵或不可行时,许多研究领域使用从“公民科学家”中获取的数据。但是,参与者由于缺乏技能而报告错误的估计或分类。我们演示了如何使用贝叶斯分层模型来学习感兴趣的潜在变量,同时考虑参与者的能力。该模型在生态应用的背景下进行了描述,该应用程序涉及来自澳大利亚大屏障礁的地理参与的珊瑚雷夫图像的众包分类。感兴趣的潜在变量是珊瑚礁的比例,这是珊瑚礁健康的常见指标。参与者的能力是根据图像上正确分类点的灵敏度和特异性表示的。该模型还结合了空间成分,该空间组件允许在尚未进行调查的位置预测潜在变量。我们表明,该模型的表现优于传统的加权回归方法,以说明公民科学数据中的不确定性。我们的方法会产生更准确的回归系数,并为潜在的感兴趣过程提供了更好的表征。这种新方法是在概率编程语言stan中实施的,可以应用于依赖不确定的公民科学数据的多种问题。
Many research domains use data elicited from "citizen scientists" when a direct measure of a process is expensive or infeasible. However, participants may report incorrect estimates or classifications due to their lack of skill. We demonstrate how Bayesian hierarchical models can be used to learn about latent variables of interest, while accounting for the participants' abilities. The model is described in the context of an ecological application that involves crowdsourced classifications of georeferenced coral-reef images from the Great Barrier Reef, Australia. The latent variable of interest is the proportion of coral cover, which is a common indicator of coral reef health. The participants' abilities are expressed in terms of sensitivity and specificity of a correctly classified set of points on the images. The model also incorporates a spatial component, which allows prediction of the latent variable in locations that have not been surveyed. We show that the model outperforms traditional weighted-regression approaches used to account for uncertainty in citizen science data. Our approach produces more accurate regression coefficients and provides a better characterization of the latent process of interest. This new method is implemented in the probabilistic programming language Stan and can be applied to a wide number of problems that rely on uncertain citizen science data.