推理时动态网络适应

论文标题

推理时动态网络适应

Dynamic Network Adaptation at Inference

论文作者

Mendoza, Daniel, Trippel, Caroline

论文摘要

机器学习（ML）推断是一个实时工作负载，必须符合严格的服务水平目标（SLO），包括延迟和准确性目标。不幸的是，由于固有的模型准确度折衷方案，跨应用域内和内部的SLO多样性，随着时间的推移，不可预测的查询模式和共同位置干扰，确保在推理服务系统中不违反SLO是具有挑战性的。在本文中，我们观察到，神经网络在推理过程中表现出高度的每输入激活稀疏度。。因此，我们提出了SLO-感知的神经网络，该神经网络通过指定的SLO优化目标和机器利用率，动态地删除每推出每个推动查询的节点，从而调整执行的计算量。 Slo-Aware神经网络的平均加速度为$ 1.3-56.7 \ times $，几乎没有准确损失（小于0.3％）。当准确性受到限制时，Slo-Aware神经网络能够通过相同的训练模型在低潜伏期处为一系列精确目标。当潜伏期受到限制时，SLO-感知神经网络可以主动减轻共同位置干扰的潜伏期降解，同时保持高精度以满足潜伏期约束。

Machine learning (ML) inference is a real-time workload that must comply with strict Service Level Objectives (SLOs), including latency and accuracy targets. Unfortunately, ensuring that SLOs are not violated in inference-serving systems is challenging due to inherent model accuracy-latency tradeoffs, SLO diversity across and within application domains, evolution of SLOs over time, unpredictable query patterns, and co-location interference. In this paper, we observe that neural networks exhibit high degrees of per-input activation sparsity during inference. . Thus, we propose SLO-Aware Neural Networks which dynamically drop out nodes per-inference query, thereby tuning the amount of computation performed, according to specified SLO optimization targets and machine utilization. SLO-Aware Neural Networks achieve average speedups of $1.3-56.7\times$ with little to no accuracy loss (less than 0.3%). When accuracy constrained, SLO-Aware Neural Networks are able to serve a range of accuracy targets at low latency with the same trained model. When latency constrained, SLO-Aware Neural Networks can proactively alleviate latency degradation from co-location interference while maintaining high accuracy to meet latency constraints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题