论文标题
流量炼油厂:用于网络流量的机器学习的成本感知数据表示
Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic
论文作者
论文摘要
网络管理通常依靠机器学习来预测网络流量的性能和安全性。通常,流量的表示与模型的选择一样重要。该模型所依赖的功能以及这些功能的表示,最终确定了模型的准确性,以及在实践中是否可以部署模型。因此,对这些模型的设计和评估最终不仅需要了解模型的准确性,还需要与将模型部署在运营网络中相关的系统成本。为了实现这一目标,本文开发了一个新的框架和系统,可以对机器学习性能(例如模型准确性)的常规概念以及网络流量不同表示的系统级成本进行联合评估。我们重点介绍了两个实用网络管理任务的这两个维度,即视频流质量推断和恶意软件检测,以证明探索不同表示形式以找到合适的操作点的重要性。我们证明了探索网络流量的一系列表示和目前的炼油厂的好处,这是概念验证实现的实现,该实现均以10 Gbps监视网络流量,并实时转换流量,以生成各种功能表示机器学习。交通炼油厂既突出了这个设计空间,并且可以探索学习的不同表示,平衡系统成本与特征提取和模型培训相关的模型准确性。
Network management often relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and the representation of those features, ultimately determine model accuracy, as well as where and whether the model can be deployed in practice. Thus, the design and evaluation of these models ultimately requires understanding not only model accuracy but also the systems costs associated with deploying the model in an operational network. Towards this goal, this paper develops a new framework and system that enables a joint evaluation of both the conventional notions of machine learning performance (e.g., model accuracy) and the systems-level costs of different representations of network traffic. We highlight these two dimensions for two practical network management tasks, video streaming quality inference and malware detection, to demonstrate the importance of exploring different representations to find the appropriate operating point. We demonstrate the benefit of exploring a range of representations of network traffic and present Traffic Refinery, a proof-of-concept implementation that both monitors network traffic at 10 Gbps and transforms traffic in real time to produce a variety of feature representations for machine learning. Traffic Refinery both highlights this design space and makes it possible to explore different representations for learning, balancing systems costs related to feature extraction and model training against model accuracy.