扩展任务并行程序的高级合成

论文标题

扩展任务并行程序的高级合成

Extending High-Level Synthesis for Task-Parallel Programs

论文作者

Chi, Yuze, Guo, Licheng, Lau, Jason, Choi, Young-kyu, Wang, Jie, Cong, Jason

论文摘要

基于C/C ++/OPENCL的高级合成（HLS）越来越流行，近年来许多应用程序域中许多应用程序域中的野外可编程栅极阵列（FPGA）加速器，由于其竞争性质量（QOR）和短期开发周期与传统的寄存器 - 转移级设计方法相比。然而，受顺序C语义的限制，在许多其他应用程序域中采用相同高效的高级编程方法仍然具有挑战性，在这些应用程序域中，粗粒度的任务并行运行并以细粒度的水平相互交流。虽然当前的HLS工具确实支持任务并行程序，但由于可编程性较差，在代码开发周期中的生产率受到了极大的限制（1），（2）由于限制软件仿真而导致的正确性验证周期中，以及（3）由于代码生成慢而引起的QOR调整周期。这样有限的生产力通常会打败HLS和阻碍程序员为任务并行FPGA加速器采用HLS的目的。在本文中，我们扩展了HLS C ++语言，并提供了一个完全自动化的框架，该框架具有程序员友好的接口，不受约束的软件仿真和快速的层次代码生成，以克服这些限制并演示如何在HLS中有效地支持任务并行程序。基于广泛的现实任务 - 平行程序的实验结果表明，平均而言，内核和主机代码的线路分别减少了22％和51％，这大大提高了可编程性。正确的性验证和迭代QOR调整周期分别大大缩短了3.2倍和6.8倍。我们的工作是https://github.com/ucla-vast/tapa/。

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools do support task-parallel programs, the productivity is greatly limited (1) in the code development cycle due to the poor programmability, (2) in the correctness verification cycle due to restricted software simulation, and (3) in the QoR tuning cycle due to slow code generation. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, unconstrained software simulation, and fast hierarchical code generation to overcome these limitations and demonstrate how task-parallel programs can be productively supported in HLS. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly shortened by 3.2x and 6.8x, respectively. Our work is open-source at https://github.com/UCLA-VAST/tapa/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题