数据中心网络的可扩展尾部潜伏期估计

论文标题

数据中心网络的可扩展尾部潜伏期估计

Scalable Tail Latency Estimation for Data Center Networks

论文作者

Zhao, Kevin, Goyal, Prateesh, Alizadeh, Mohammad, Anderson, Thomas E.

论文摘要

在本文中，我们考虑了如何为非常大的数据中心网络提供流量水平潜伏期性能的快速估计。网络尾潜伏期通常是云应用程序性能的关键指标，可能会受到各种因素的影响，包括网络负载，链路间的交通偏斜，交通爆发，流量尺寸分布，超额认证和拓扑不对称。 NS-3和Omnet ++等网络模拟器可以提供准确的答案，但是很难并行化，花费数小时或几天来回答是否要以中等规模的单一配置问题。与Mimicnet的最新工作已经展示了如何使用机器学习来提高模拟性能，但要以每次配置的长期培训步骤以及有关工作量和拓扑统一性的假设，通常在实践中不存在。我们通过开发一系列技术来解决这一差距，以提供具有一般交通矩阵和拓扑的大型网络的快速性能估计。关键步骤是将问题分解为大量平行独立的单链接模拟。我们仔细结合了这些链接级模拟，以为整个网络提供端到端流量水平性能分布的准确估计。像Mimicnet一样，我们在可能的情况下利用对称性来获得额外的加速，但不依靠机器学习，因此没有训练延迟。在大规模网络中，NS-3需要11到27个小时才能模拟五秒钟的网络行为，我们的技术在一到两分钟内运行，在流量完成时间内99％的精度在9％的时间内运行。

In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns-3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this gap by developing a set of techniques to provide fast performance estimates for large scale networks with general traffic matrices and topologies. A key step is to decompose the problem into a large number of parallel independent single-link simulations; we carefully combine these link-level simulations to produce accurate estimates of end-to-end flow level performance distributions for the entire network. Like MimicNet, we exploit symmetry where possible to gain additional speedups, but without relying on machine learning, so there is no training delay. On large-scale networks where ns-3 takes 11 to 27 hours to simulate five seconds of network behavior, our techniques run in one to two minutes with 99th percentile accuracy within 9% for flow completion times.

下载PDF全文

下载文献需遵守相关版权规定

论文标题