通过共同设计加速对长量架构的CNN推断

论文标题

通过共同设计加速对长量架构的CNN推断

Accelerating CNN inference on long vector architectures via co-design

论文作者

Gupta, Sonia Rani, Papadopoulou, Nikela, Pericas, Miquel

论文摘要

基于CPU的推理可以替代芯片加速器，并且由于其效率，向量体系结构是一个有前途的选择。但是，卷积算法和硬件实现的庞大设计空间使得选择最佳选项变得具有挑战性。本文介绍了基于CPU的CNN推断的共同设计矢量体系结构的持续研究，重点是IM2Col+Gemm和Winograd内核。使用GEM5模拟器，我们研究了各种硬件微体系特征对RISC-V矢量和ARM-SVE ISA的影响。我们还研究了几种BLIS样算法优化对IM2Col+GEMM的影响。我们的共设计研究表明，与矢量长度为512位和1MB的L2缓存相比，使用优化的CNN内核使用更长的矢量长度和较大的缓存可以提高性能5倍。对于Winograd，我们提出了一种新颖的瓷砖平行化方法，该方法利用了更长的矢量长度并提供了高内存重复使用，从而使非构造卷积层的2.4倍性能提高具有3x3内核大小。我们的研究还表明，与IM2COL+GEMM相比，Winograd需要较小的缓存尺寸。

CPU-based inference can be an alternative to off-chip accelerators, and vector architectures are a promising option due to their efficiency. However, the large design space of convolutional algorithms and hardware implementations makes it challenging to select the best options. This paper presents ongoing research into co-designing vector architectures for CPU-based CNN inference, focusing on the im2col+GEMM and Winograd kernels. Using the Gem5 simulator, we examine the impact of various hardware microarchitectural features on RISC-V Vector and ARM-SVE ISAs. We also study the impact of several BLIS-like algorithmic optimizations on im2col+GEMM. Our co-design study shows that longer vector lengths and larger caches can improve performance by 5x with our optimized CNN kernels, compared to a vector length of 512-bit and 1MB of L2 cache. For Winograd, we present a novel approach of inter-tile parallelization that exploits longer vector lengths and offers high memory reuse, resulting in up to 2.4x performance improvement for non-strided convolutional layers with 3x3 kernel size. Our study also shows that Winograd requires smaller cache sizes compared to im2col+GEMM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题