论文标题
ORLOJ:可预测的服务不可预测的DNN
Orloj: Predictably Serving Unpredictable DNNs
论文作者
论文摘要
现有的DNN服务解决方案可以通过仔细安排传入请求来维持高通量的延迟SLOS,该请求的执行时间被认为是高度可预测的,并且无关紧要。但是,推理请求对新兴动态DNN(例如,流行的自然语言处理(NLP)模型和计算机视觉(CV)模型(跳过层) - 依赖数据依赖于数据依赖数据。使用现有解决方案服务时,他们的性能差,因为他们在请求执行时间方面经历了较大的差异 - 批次中最长的请求夸大了较小较小方法的执行时间,因此在没有仔细批处理的情况下导致SLO错过。 在本文中,我们提出了一种动态DNN服务系统ORLOJ,它使用预期请求执行时间的经验分布捕获了动态DNN中的这一差异,然后有效地批量批次并安排它们,而不知道请求的确切执行时间。 ORLOJ在紧密的SLO限制下,对高方差DNN工作载荷的最先进解决方案的最先进解决方案的完成率在51--80%上,而在更轻松的SLO设置下,超过100%。对于经过深入研究的静态DNN工作负载,ORLOJ与最先进的表现保持了可比的性能。
Existing DNN serving solutions can provide tight latency SLOs while maintaining high throughput via careful scheduling of incoming requests, whose execution times are assumed to be highly predictable and data-independent. However, inference requests to emerging dynamic DNNs -- e.g., popular natural language processing (NLP) models and computer vision (CV) models that skip layers -- are data-dependent. They exhibit poor performance when served using existing solutions because they experience large variance in request execution times depending on the input -- the longest request in a batch inflates the execution times of the smaller ones, causing SLO misses in the absence of careful batching. In this paper, we present Orloj, a dynamic DNN serving system, that captures this variance in dynamic DNNs using empirical distributions of expected request execution times, and then efficiently batches and schedules them without knowing a request's precise execution time. Orloj significantly outperforms state-of-the-art serving solutions for high variance dynamic DNN workloads by 51--80% in finish rate under tight SLO constraints, and over 100% under more relaxed SLO settings. For well-studied static DNN workloads, Orloj keeps comparable performance with the state-of-the-art.