S3NA：快速NPU感知的神经体系结构搜索方法

论文标题

S3NA：快速NPU感知的神经体系结构搜索方法

S3NAS: Fast NPU-aware Neural Architecture Search Methodology

论文作者

Lee, Jaeseong, Kang, Duseok, Ha, Soonhoi

论文摘要

随着卷积神经网络（CNN）的应用领域在嵌入式设备中生长，使用硬件CNN加速器（称为神经加工单元（NPU））比CPU或GPU更高的性能变得更加流行。最近，自动化的神经体系结构搜索（NAS）是默认技术，可以找到比手动设计的图像分类的最先进的CNN体系结构。在本文中，我们提出了一种称为S3NA的快速NPU感知的NAS方法，以找到比在给定延迟约束下现有的CNN体系结构更高的CNN体系结构。它由三个步骤组成：超级网设计，用于快速体系结构探索的单路径NAS和扩展。为了扩大由阶段组成的超级网络结构的搜索空间，我们允许阶段具有不同数量的块和块具有不同内核大小的平行层。为了进行快速的神经体系结构搜索，我们将修改后的单路NAS技术应用于提议的超网结构。在此步骤中，我们假设延迟约束要比减少搜索空间和搜索时间所需的延迟约束。最后一步是在延迟约束中最大程度地扩展网络。为了获得准确的延迟估计，根据周期级别的NPU模拟器设计了一个分析延迟估计器，该循环级别的NPU模拟器可以准确地考虑内存访问开销。通过提出的方法，我们可以使用TPUV3在3小时内找到一个网络，TPUV3显示出具有11.66 ms延迟的ImageNet上的TOP-1精度为82.72％。代码在https://github.com/cap-lab/s3nas上发布

As the application area of convolutional neural networks (CNN) is growing in embedded devices, it becomes popular to use a hardware CNN accelerator, called neural processing unit (NPU), to achieve higher performance per watt than CPUs or GPUs. Recently, automated neural architecture search (NAS) emerges as the default technique to find a state-of-the-art CNN architecture with higher accuracy than manually-designed architectures for image classification. In this paper, we present a fast NPU-aware NAS methodology, called S3NAS, to find a CNN architecture with higher accuracy than the existing ones under a given latency constraint. It consists of three steps: supernet design, Single-Path NAS for fast architecture exploration, and scaling. To widen the search space of the supernet structure that consists of stages, we allow stages to have a different number of blocks and blocks to have parallel layers of different kernel sizes. For a fast neural architecture search, we apply a modified Single-Path NAS technique to the proposed supernet structure. In this step, we assume a shorter latency constraint than the required to reduce the search space and the search time. The last step is to scale up the network maximally within the latency constraint. For accurate latency estimation, an analytical latency estimator is devised, based on a cycle-level NPU simulator that runs an entire CNN considering the memory access overhead accurately. With the proposed methodology, we are able to find a network in 3 hours using TPUv3, which shows 82.72% top-1 accuracy on ImageNet with 11.66 ms latency. Code are released at https://github.com/cap-lab/S3NAS

下载PDF全文

下载文献需遵守相关版权规定

论文标题