通过冻结加速深度学习推论

论文标题

通过冻结加速深度学习推论

Accelerating Deep Learning Inference via Freezing

论文作者

Kumar, Adarsh, Balasubramanian, Arjun, Venkataraman, Shivaram, Akella, Aditya

论文摘要

在过去的几年中，由于它们在实际任务上的准确性很高，深层神经网络（DNN）已变得无处不在。但是，准确性的提高是以计算昂贵的模型为代价的，导致预测潜伏期更高。先前努力减少这种延迟，例如量化，模型蒸馏和任何时间的预测模型，通常是绩效的权衡准确性。在这项工作中，我们观察到缓存中间层输出可以帮助我们避免运行DNN的所有层，以获得相当大的推理请求。我们发现，这可以将有效层的数量减少一半，而在RESNET-18上运行的CIFAR-10请求中有91.58％。我们提出冻结推理，该系统在每个中间层引入近似缓存的系统，我们讨论了降低缓存尺寸并提高缓存命中率的技术。最后，我们讨论了实现这种设计的一些开放研究挑战。

Over the last few years, Deep Neural Networks (DNNs) have become ubiquitous owing to their high accuracy on real-world tasks. However, this increase in accuracy comes at the cost of computationally expensive models leading to higher prediction latencies. Prior efforts to reduce this latency such as quantization, model distillation, and any-time prediction models typically trade-off accuracy for performance. In this work, we observe that caching intermediate layer outputs can help us avoid running all the layers of a DNN for a sizeable fraction of inference requests. We find that this can potentially reduce the number of effective layers by half for 91.58% of CIFAR-10 requests run on ResNet-18. We present Freeze Inference, a system that introduces approximate caching at each intermediate layer and we discuss techniques to reduce the cache size and improve the cache hit rate. Finally, we discuss some of the open research challenges in realizing such a design.

下载PDF全文

下载文献需遵守相关版权规定

论文标题