论文标题
提供诸如发条之类的DNN:自下而上的性能可预测性
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
论文作者
论文摘要
机器学习推断已成为交互式Web应用程序的核心构建块。结果,这些应用程序依赖的基础模型服务系统必须始终如一地满足低延迟目标。现有的模型为体系结构使用众所周知的反应性技术来减轻常见的延迟来源,但不能有效地减少由无法预测的执行时间引起的尾巴潜伏期。然而,基本的执行时间在根本上并不是不可预测的 - 相反,我们观察到使用深神经网络(DNN)模型的推论具有确定性的性能。在这里,从单个DNN推断的可预测执行时间开始,我们采用了一种原则的设计方法来依次构建一个完全分布的模型服务系统,以实现可预测的端到端性能。我们使用生产跟踪工作负载来评估我们的实施,发条,并表明发条可以支持数千个型号,同时满足100毫秒的延迟目标,即99.9999%的请求。我们进一步证明,发条利用可预测的执行时间来实现紧密的请求级服务级别目标(SLO)以及高度的请求级绩效隔离。
Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times. Yet the underlying execution times are not fundamentally unpredictable - on the contrary we observe that inference using Deep Neural Network (DNN) models has deterministic performance. Here, starting with the predictable execution times of individual DNN inferences, we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance. We evaluate our implementation, Clockwork, using production trace workloads, and show that Clockwork can support thousands of models while simultaneously meeting 100ms latency targets for 99.9999% of requests. We further demonstrate that Clockwork exploits predictable execution times to achieve tight request-level service-level objectives (SLOs) as well as a high degree of request-level performance isolation.