使用Phast到港口Caffe图书馆：首先经验和经验教训

论文标题

使用Phast到港口Caffe图书馆：首先经验和经验教训

Using PHAST to port Caffe library: First experiences and lessons learned

论文作者

Gómez-Hernández, Eduardo José, Martínez, Pablo Antonio, Peccerillo, Biagio, Bartolini, Sandro, García, José Manuel, Bernabé, Gregorio

论文摘要

性能一直是计算中的热门话题。但是，实现它的可行方法在计算历史的不同时刻中采取了多种形式。如今，技术限制已经推动了越来越平行的多核和多核体系结构的采用，甚至推动了使用高度特定的硬件（又称域特异性架构或DSA）来解决非常具体的问题。在这种新环境中，一个主要的问题是如何开发软件一次，并能够在多个加速器架构上无缝地运行它。理想情况下是针对单个编程模型，该模型可以自动将代码定位于不同类型的并行体系结构，从而使特定的调整以最小的（如果有的话）更改对源代码，以寻求性能便携性。仍然缺乏全面的解决方案。在这项工作中，我们介绍了PHAST库的使用，该库允许用户以高水平的抽象和高生产率进行编码，从而可以通过更改编译过程自动针对不同的并行设备。作为一个案例研究，我们曾在著名的深度学习Caffe框架的移植中工作。该框架已分为不同的部分，其中一些部分已移植，获得了可以在CPU和GPU上同时运行的直接实现。我们总结了在移植过程中讨论的经验教训，并在完成移植并将其扩展到将来的随后的工作的角度分析获得的绩效。

Performance has always been a hot topic in computing. However, the viable ways to achieve it have taken many forms in the different moments of computing history. Today, technological limits have pushed the adoption of increasingly parallel multi-core and many-core architectures and even the use of highly specific hardware (aka Domain-Specific Architectures, or DSAs) to solve very specific problems. In this new context, one major problem is how to develop software once, and be able to run it on multiple accelerator architectures, seamlessly. Ideally aiming at a single programming model that can automatically target the code to different kinds of parallel architectures, allowing specific tuning with minimal, if any, changes to the source-code in order to seek performance portability. A comprehensive solution to this is still lacking. In this work, we present the use of the PHAST Library, which allows users to code once, at a high level of abstraction and thus with high productivity, and automatically targeting different parallel devices by changing the compilation process. As a case study, we have worked on the porting of the well-known deep-learning Caffe framework. The framework has been split into different parts and some of them have been ported, obtaining a working straightforward implementation that can be run on both CPUs and GPUs. We conclude discussing the lessons learned during the porting process, and analyzing the obtained performance in the perspective of completing the porting and expanding it to future consequent works.

下载PDF全文

下载文献需遵守相关版权规定

论文标题