基于密度的开普勒数据的离群值评分

论文标题

基于密度的开普勒数据的离群值评分

Density Based Outlier Scoring on Kepler Data

论文作者

Giles, Daniel, Walkowicz, Lucianne

论文摘要

在当今的大规模调查时代，大数据为异常数据发现过程带来了新的挑战。这样的数据可以指示已知现象的系统误差，极端（或罕见）形式，或者最有趣的是真正新颖的现象，这些现象表现出尚未被视为的行为。在这项工作中，我们提出了一种异常得分的方法，以识别和表征最有希望的不寻常来源，以促进这种异常数据的发现。我们已经基于特征空间中的k-nearthign距离开发了一种数据挖掘方法，以有效地识别最异常的光曲面。我们测试了该方法的变化，包括使用特征空间的主要组件，删除选定特征，选择K的效果以及对子集样品的评分。我们评估了我们对已知对象类别的评分的绩效，发现我们的评分始终是罕见的（<1000）对象类，高于普通类。我们已经在开普勒的主要任务的第1至17季度的所有长节奏曲面上都应用了得分，并以280万光弯曲的所有280万光弯曲的分数对大约200k的物体进行了分数。

In the present era of large scale surveys, big data presents new challenges to the discovery process for anomalous data. Such data can be indicative of systematic errors, extreme (or rare) forms of known phenomena, or most interestingly, truly novel phenomena which exhibit as-of-yet unobserved behaviors. In this work we present an outlier scoring methodology to identify and characterize the most promising unusual sources to facilitate discoveries of such anomalous data. We have developed a data mining method based on k-Nearest Neighbor distance in feature space to efficiently identify the most anomalous lightcurves. We test variations of this method including using principal components of the feature space, removing select features, the effect of the choice of k, and scoring to subset samples. We evaluate the peformance of our scoring on known object classes and find that our scoring consistently scores rare (<1000) object classes higher than common classes. We have applied scoring to all long cadence lightcurves of quarters 1 to 17 of Kepler's prime mission and present outlier scores for all 2.8 million lightcurves for the roughly 200k objects.

下载PDF全文

下载文献需遵守相关版权规定

论文标题