数据流意识将卷积神经网络映射到具有网络固定互连的多核平台上

论文标题

数据流意识将卷积神经网络映射到具有网络固定互连的多核平台上

Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core Platforms With Network-on-Chip Interconnect

论文作者

Bytyn, Andreas, Ahlsdorf, René, Leupers, Rainer, Ascheid, Gerd

论文摘要

在过去的几年中，机器智能，尤其是使用卷积神经网络（CNN）已成为大量研究。提出了越来越复杂的硬件加速器的利用，例如计算中的稀疏性并利用降低的精度算术来缩小能源消耗。但是，未来的平台不仅需要能源效率：可伸缩性已成为越来越重要的因素。随着加速器的大小，实施物理实施所需的工作量使得达到目标限制变得更加困难。使用由几个均匀核心组成的多核平台可以减轻上述有关物理实施的限制，而牺牲了增加数据流映射工作。尽管CNN中的数据流是确定性的，因此可以离线优化，但找到合适的方案以最小化运行时和芯片内存储器访问的问题是一项具有挑战性的任务，如果涉及互连系统，它将变得更加复杂。这项工作提出了一种从单核级别开始的自动映射策略，其最小运行时和最小芯片内存访问的不同优化目标。然后将该策略扩展到合适的多核映射方案，并使用具有网络芯片互连网络的可扩展系统级仿真进行评估。设计空间探索是通过将众所周知的CNN Alexnet和VGG-16映射到不同核心计数和每个核心计算能力平台的平台来进行的，以调查权衡取舍。我们的映射策略和系统设置从单个核心级别开始缩放到128个内核，从而显示了所选方法的限制。

Machine intelligence, especially using convolutional neural networks (CNNs), has become a large area of research over the past years. Increasingly sophisticated hardware accelerators are proposed that exploit e.g. the sparsity in computations and make use of reduced precision arithmetic to scale down the energy consumption. However, future platforms require more than just energy efficiency: Scalability is becoming an increasingly important factor. The required effort for physical implementation grows with the size of the accelerator making it more difficult to meet target constraints. Using many-core platforms consisting of several homogeneous cores can alleviate the aforementioned limitations with regard to physical implementation at the expense of an increased dataflow mapping effort. While the dataflow in CNNs is deterministic and can therefore be optimized offline, the problem of finding a suitable scheme that minimizes both runtime and off-chip memory accesses is a challenging task which becomes even more complex if an interconnect system is involved. This work presents an automated mapping strategy starting at the single-core level with different optimization targets for minimal runtime and minimal off-chip memory accesses. The strategy is then extended towards a suitable many-core mapping scheme and evaluated using a scalable system-level simulation with a network-on-chip interconnect. Design space exploration is performed by mapping the well-known CNNs AlexNet and VGG-16 to platforms of different core counts and computational power per core in order to investigate the trade-offs. Our mapping strategy and system setup is scaled starting from the single core level up to 128 cores, thereby showing the limits of the selected approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题