论文标题
编码:编码网络异常检测的NetFlows
ENCODE: Encoding NetFlows for Network Anomaly Detection
论文作者
论文摘要
NetFlow数据是许多网络分析师和研究人员使用的流行网络日志格式。使用NetFlow而不是深度数据包检查的优点是,它更容易收集和处理,并且较少的隐私侵入性。许多作品都使用机器学习来检测NetFlow数据的网络攻击。这些机器学习管道的第一步是在将数据提供给机器学习算法之前对其进行预处理。预处理NetFlow数据存在许多方法;但是,这些只是将现有方法应用于数据,而不是考虑网络数据的特定属性。我们认为,对于源自软件系统(例如NetFlow或软件日志)的数据,频率和特征值上下文的相似性比值本身的相似性更为重要。在这项工作中,我们提出了一种编码算法,该算法在处理数据时直接考虑了特征值的频率和上下文。可以使用此编码来聚集不同类型的网络行为,从而有助于检测网络中的异常。我们使用已编码算法编码的数据来训练多个机器学习模型,以进行异常检测。我们在新的数据集上评估了我们为Kubernetes群集和两个著名的公共NetFlow数据集创建的新数据集的编码有效性。我们从经验上证明,机器学习模型从使用我们的编码进行异常检测中受益。
NetFlow data is a popular network log format used by many network analysts and researchers. The advantages of using NetFlow over deep packet inspection are that it is easier to collect and process, and it is less privacy intrusive. Many works have used machine learning to detect network attacks using NetFlow data. The first step for these machine learning pipelines is to pre-process the data before it is given to the machine learning algorithm. Many approaches exist to pre-process NetFlow data; however, these simply apply existing methods to the data, not considering the specific properties of network data. We argue that for data originating from software systems, such as NetFlow or software logs, similarities in frequency and contexts of feature values are more important than similarities in the value itself. In this work, we propose an encoding algorithm that directly takes the frequency and the context of the feature values into account when the data is being processed. Different types of network behaviours can be clustered using this encoding, thus aiding the process of detecting anomalies within the network. We train several machine learning models for anomaly detection using the data that has been encoded with our encoding algorithm. We evaluate the effectiveness of our encoding on a new dataset that we created for network attacks on Kubernetes clusters and two well-known public NetFlow datasets. We empirically demonstrate that the machine learning models benefit from using our encoding for anomaly detection.