论文标题
迈向高性能,便携性和生产力:轻巧的增强神经网络以进行性能预测
Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction
论文作者
论文摘要
编写高性能代码需要在编程语言,编译器优化和硬件知识方面具有重要的专业知识。这通常会导致生产率和便携性差,并且对于像物理学家这样的非编程器领域专家是不方便的。更可取的是一种高级语言,域特殊主义者只是根据高级操作(例如,矩阵 - 多普利(a,b))来指定工作量,而编译器可以完全使用异质平台确定最佳实现。为了创建一个同时支持生产力,可移植性和性能的编译器,至关重要的是,预测在各种硬件上的工作负载中包含的主导操作(kernels)的各种可用实现(变体)的性能(kernels)至关重要的(a)要确定哪些变体应为每个内核在工作负载和(b)上的变体中选择哪种硬件以及在哪个硬件中进行跑步。为了启用性能预测,我们提出了轻巧的增强神经网络,以进行内核变化 - 硬件的任意组合。一个关键的创新是利用内核的数学复杂性作为达到更高准确性的功能。这些模型是紧凑的,可以减少训练时间和在编译时间和运行时的快速推断。使用少于75个参数的模型,只有250个培训数据实例,我们能够获得3%的低MAPE,在48个内核变化 - 硬件组合上的表现明显优于传统的前馈神经网络。我们进一步证明,我们的变体选择方法可用于卤化物实现,以在Halide的自动安排机上获得高达1.7倍的速度。
Writing high-performance code requires significant expertise in the programming language, compiler optimizations, and hardware knowledge. This often leads to poor productivity and portability and is inconvenient for a non-programmer domain-specialist such as a Physicist. More desirable is a high-level language where the domain-specialist simply specifies the workload in terms of high-level operations (e.g., matrix-multiply(A, B)), and the compiler identifies the best implementation fully utilizing the heterogeneous platform. For creating a compiler that supports productivity, portability, and performance simultaneously, it is crucial to predict the performance of various available implementations (variants) of the dominant operations (kernels) contained in the workload on various hardware to decide (a) which variant should be chosen for each kernel in the workload, and (b) on which hardware resource the variant should run. To enable the performance prediction, we propose lightweight augmented neural networks for arbitrary combinations of kernel-variant-hardware. A key innovation is utilizing the mathematical complexity of the kernels as a feature to achieve higher accuracy. These models are compact to reduce training time and fast inference during compile-time and run-time. Using models with less than 75 parameters, and only 250 training data instances, we are able to obtain a low MAPE of 3%, significantly outperforming traditional feed-forward neural networks on 48 kernel-variant-hardware combinations. We further demonstrate that our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide's auto-scheduler.