论文标题
推理时动态网络适应
Dynamic Network Adaptation at Inference
论文作者
论文摘要
机器学习(ML)推断是一个实时工作负载,必须符合严格的服务水平目标(SLO),包括延迟和准确性目标。不幸的是,由于固有的模型准确度折衷方案,跨应用域内和内部的SLO多样性,随着时间的推移,不可预测的查询模式和共同位置干扰,确保在推理服务系统中不违反SLO是具有挑战性的。在本文中,我们观察到,神经网络在推理过程中表现出高度的每输入激活稀疏度。 。因此,我们提出了SLO-感知的神经网络,该神经网络通过指定的SLO优化目标和机器利用率,动态地删除每推出每个推动查询的节点,从而调整执行的计算量。 Slo-Aware神经网络的平均加速度为$ 1.3-56.7 \ times $,几乎没有准确损失(小于0.3%)。当准确性受到限制时,Slo-Aware神经网络能够通过相同的训练模型在低潜伏期处为一系列精确目标。当潜伏期受到限制时,SLO-感知神经网络可以主动减轻共同位置干扰的潜伏期降解,同时保持高精度以满足潜伏期约束。
Machine learning (ML) inference is a real-time workload that must comply with strict Service Level Objectives (SLOs), including latency and accuracy targets. Unfortunately, ensuring that SLOs are not violated in inference-serving systems is challenging due to inherent model accuracy-latency tradeoffs, SLO diversity across and within application domains, evolution of SLOs over time, unpredictable query patterns, and co-location interference. In this paper, we observe that neural networks exhibit high degrees of per-input activation sparsity during inference. . Thus, we propose SLO-Aware Neural Networks which dynamically drop out nodes per-inference query, thereby tuning the amount of computation performed, according to specified SLO optimization targets and machine utilization. SLO-Aware Neural Networks achieve average speedups of $1.3-56.7\times$ with little to no accuracy loss (less than 0.3%). When accuracy constrained, SLO-Aware Neural Networks are able to serve a range of accuracy targets at low latency with the same trained model. When latency constrained, SLO-Aware Neural Networks can proactively alleviate latency degradation from co-location interference while maintaining high accuracy to meet latency constraints.