FADO：多-DIE FPGA的高级合成设计的平面图指令优化

论文标题

FADO：多-DIE FPGA的高级合成设计的平面图指令优化

FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs

论文作者

Du, Linfeng, Liang, Tingyuan, Sinha, Sharad, Xie, Zhiyao, Zhang, Wei

论文摘要

多-DIE FPGA被广泛用于部署大型硬件加速器。两个因素阻碍了在多-DIE FPGA上实施的HLS设计的性能优化。一方面，越过越过模具束的网的长净延迟会导致NP硬质问题适当地平面图和管道。另一方面，HLS指令优化的传统自动化搜索流针对单DIE FPGA，因此，它不能考虑每个模具上的资源限制，以及死亡交叉所产生的正时问题。此外，由于大型设计量表，在指令优化期间，在指令优化期间，在每组配置下生成的HLS设计的平面图合法化。为了优化多-DIE FPGAS上HLS设计的指令和平面图，我们提出了FADO框架，该框架基于多选择的多维多维Bin包装制定了指令 - 地板的共同搜索问题，并使用迭代优化流程来解决。对于指令搜索的每个步骤，延迟底层引导的贪婪算法都搜索更有效的指令配置。为了进行平面规划，我们没有重复产生全球平面图算法，而是实施了更有效的增量平面图合法化算法。它主要采用最差的在线包装算法来平衡平面图，并连同离线最佳拟合的重新包装一起重新包装，以紧凑地板平面图，然后是穿过越过模具的长电线的管道。通过在HLS设计上进行混合数据流和非数据流核的实验，Fado不仅可以很好地进行合作式化，并在693x〜4925 x较短的运行时完成，而DSE在全球平面图的辅助下，还可以在整体上的1.16x 〜8.78 x中的辅助工具进行了5.78 x，又可以在整体上执行5.78 x，又可以实现x.78 x。

Multi-die FPGAs are widely adopted to deploy large hardware accelerators. Two factors impede the performance optimization of HLS designs implemented on multi-die FPGAs. On the one hand, the long net delay due to nets crossing die-boundaries results in an NP-hard problem to properly floorplan and pipeline an application. On the other hand, traditional automated searching flow for HLS directive optimizations targets single-die FPGAs, and hence, it cannot consider the resource constraints on each die and the timing issue incurred by the die-crossings. Further, it leads to an excessively long runtime to legalize the floorplan of HLS designs generated under each group of configurations during directive optimization due to the large design scale. To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we propose the FADO framework, which formulates the directive-floorplan co-search problem based on the multi-choice multi-dimensional bin-packing and solves it using an iterative optimization flow. For each step of directive search, a latency-bottleneck-guided greedy algorithm searches for more efficient directive configurations. For floorplanning, instead of repetitively incurring global floorplanning algorithms, we implement a more efficient incremental floorplan legalization algorithm. It mainly applies the worst-fit online bin-packing algorithm to balance the floorplan, together with an offline best-fit-decreasing re-packing to compact the floorplan, followed by pipelining of long wires crossing die-boundaries. Through experiments on HLS designs mixing dataflow and non-dataflow kernels, FADO not only well-automates the co-optimization and finishes within 693X~4925X shorter runtime, compared with DSE assisted by global floorplanning, but also yields an improvement of 1.16X~8.78X in overall workflow execution time after implementation on the Xilinx Alveo U250 FPGA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题