论文标题
使用InfuxDB和Python在时间序列数据中检测异常
Detection of Anomalies in a Time Series Data using InfluxDB and Python
论文作者
论文摘要
水和环境数据的分析是许多智能水和环境系统应用的重要方面,在这些分析中推断在决策中起着重要作用。由于系统分解,传感器探测器的故障等不同原因,这些数据通常是异常的。无论其根本原因是什么,此类数据都会严重影响随后分析的结果。本文展示了时间序列数据的数据清洁和准备,并进一步提出了对成本敏感的机器学习算法作为检测时间序列数据中异常数据点的解决方案。以下模型:已修改了逻辑回归,随机森林,支持向量机器,以支持对成本敏感的学习,从而对错误分类的样本进行了惩罚,从而最大程度地减少了总错误分类成本。我们的结果表明,随机森林在预测正类别(即异常)方面优于其他模型。应用预测模型改进技术(例如数据过采样)似乎对随机森林模型几乎没有改进。有趣的是,通过消除递归功能,我们实现了更好的模型性能,从而降低了数据中的尺寸。最后,随着ImpruxDB和Kapacitor的摄入并流式传输,以生成新的数据点,以进一步评估看不见的数据的模型性能,这将允许尽早识别饮用水质量的不良变化,并使水供应公司能够及时进行任何不良变化的及时纠正。
Analysis of water and environmental data is an important aspect of many intelligent water and environmental system applications where inference from such analysis plays a significant role in decision making. Quite often these data that are collected through sensible sensors can be anomalous due to different reasons such as systems breakdown, malfunctioning of sensor detectors, and more. Regardless of their root causes, such data severely affect the results of the subsequent analysis. This paper demonstrates data cleaning and preparation for time-series data and further proposes cost-sensitive machine learning algorithms as a solution to detect anomalous data points in time-series data. The following models: Logistic Regression, Random Forest, Support Vector Machines have been modified to support the cost-sensitive learning which penalizes misclassified samples thereby minimizing the total misclassification cost. Our results showed that Random Forest outperformed the rest of the models at predicting the positive class (i.e anomalies). Applying predictive model improvement techniques like data oversampling seems to provide little or no improvement to the Random Forest model. Interestingly, with recursive feature elimination, we achieved a better model performance thereby reducing the dimensions in the data. Finally, with Influxdb and Kapacitor the data was ingested and streamed to generate new data points to further evaluate the model performance on unseen data, this will allow for early recognition of undesirable changes in the drinking water quality and will enable the water supply companies to rectify on a timely basis whatever undesirable changes abound.