Anoshift：无监督异常检测的分配移位基准

论文标题

Anoshift：无监督异常检测的分配移位基准

AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection

论文作者

Dragoi, Marius, Burceanu, Elena, Haller, Emanuela, Manolache, Andrei, Brad, Florin

论文摘要

分析数据的分布转移是当今机器学习（ML）的研究方向的增长，从而导致新的基准测试重点是提供适合研究ML模型的通用性能的情况。现有的基准将重点放在监督的学习上，据我们所知，没有任何无人监督的学习。因此，我们引入了一个无监督的异常检测基准，其数据随着时间的推移而变化，该数据随着时间的推移而变化，该数据是在京都-2006+上建立的，这是一个用于网络入侵检测的流量数据集。这种类型的数据符合移动输入分布的前提：它涵盖了较大的时间跨度（$ 10 $ age），随着时间的推移，自然发生的变化（例如用户修改其行为模式和软件更新）。我们首先使用基本的每场分析，T-SNE和最佳运输方法来强调数据的非平稳性质，以测量年份之间的整体分布距离。接下来，我们提出Anoshift，该协议将数据分配为IID，靠近测试拆分。我们通过各种模型来验证随着时间的推移的性能退化，从经典方法到深度学习。最后，我们表明，通过确认分配转移问题并正确解决该问题，与假定独立且相同分布数据的经典培训相比，可以提高性能（我们的方法平均$ 3 \％$ $）。数据集和代码可在https://github.com/bit-ml/anoshift/上找到。

Analyzing the distribution shift of data is a growing research direction in nowadays Machine Learning (ML), leading to emerging new benchmarks that focus on providing a suitable scenario for studying the generalization properties of ML models. The existing benchmarks are focused on supervised learning, and to the best of our knowledge, there is none for unsupervised learning. Therefore, we introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This type of data meets the premise of shifting the input distribution: it covers a large time span ($10$ years), with naturally occurring changes over time (eg users modifying their behavior patterns, and software updates). We first highlight the non-stationary nature of the data, using a basic per-feature analysis, t-SNE, and an Optimal Transport approach for measuring the overall distribution distances between years. Next, we propose AnoShift, a protocol splitting the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models, ranging from classical approaches to deep learning. Finally, we show that by acknowledging the distribution shift problem and properly addressing it, the performance can be improved compared to the classical training which assumes independent and identically distributed data (on average, by up to $3\%$ for our approach). Dataset and code are available at https://github.com/bit-ml/AnoShift/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题