Polydl：用于创建高性能DL原语的多面体优化

论文标题

Polydl：用于创建高性能DL原语的多面体优化

PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives

论文作者

Tavarageri, Sanket, Heinecke, Alexander, Avancha, Sasikanth, Goyal, Gagandeep, Upadrasta, Ramakrishna, Kaul, Bharat

论文摘要

深度神经网络（DNN）彻底改变了我们生活的许多方面。 DNNS的使用变得无处不在，包括在软件中进行图像识别，语音识别，语音综合，语言翻译等等。但是，他对DNN体系结构的培训在计算上是昂贵的。创建模型后，其在预期的应用程序中的使用 - 推理任务也在计算上也很重，并且推理需要快速实时使用。为了获得当今的高性能，深度学习的代码（DL）原始码是针对通过图书馆展示的专家程序员对特定体系结构进行了优化的。但是，鉴于新的DNN体系结构的不断出现，创建手动优化代码昂贵，缓慢且不可扩展。为了应对这一绩效生产力挑战，在本文中，我们提出了编译器算法，以自动生成高性能实现的DL原始词，这些原始词与手工优化的库的性能非常匹配。我们使用多面体模型开发新的数据重用分析算法，以自动得出有效的执行时间表。此外，由于大多数DL原语都在其核心处使用一些矩阵乘法变体，因此我们开发一个灵活的框架，可以在其中插入相同的库实现，以代替循环的子集。我们表明，这样的混合动力编译器以及最少的图书馆使用方法会导致最先进的性能。我们开发了编译器算法，还可以执行操作员融合，以通过计算机系统的内存层次结构减少数据移动。

Deep Neural Networks (DNNs) have revolutionized many aspects of our lives. The use of DNNs is becoming ubiquitous including in softwares for image recognition, speech recognition, speech synthesis, language translation, to name a few. he training of DNN architectures however is computationally expensive. Once the model is created, its use in the intended application - the inference task, is computationally heavy too and the inference needs to be fast for real time use. For obtaining high performance today, the code of Deep Learning (DL) primitives optimized for specific architectures by expert programmers exposed via libraries is the norm. However, given the constant emergence of new DNN architectures, creating hand optimized code is expensive, slow and is not scalable. To address this performance-productivity challenge, in this paper we present compiler algorithms to automatically generate high performance implementations of DL primitives that closely match the performance of hand optimized libraries. We develop novel data reuse analysis algorithms using the polyhedral model to derive efficient execution schedules automatically. In addition, because most DL primitives use some variant of matrix multiplication at their core, we develop a flexible framework where it is possible to plug in library implementations of the same in lieu of a subset of the loops. We show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance. We develop compiler algorithms to also perform operator fusions that reduce data movement through the memory hierarchy of the computer system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题