Edgebert：延迟感知多任务NLP推理的句子级别的能量优化

论文标题

Edgebert：延迟感知多任务NLP推理的句子级别的能量优化

EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

论文作者

Tambe, Thierry, Hooper, Coleman, Pentecost, Lillian, Jia, Tianyu, Yang, En-Yu, Donato, Marco, Sanh, Victor, Whatmough, Paul N., Rush, Alexander M., Brooks, David, Wei, Gu-Yeon

论文摘要

基于变压器的语言模型（例如BERT）为多种自然语言处理（NLP）任务提供了显着的准确性提高。但是，他们的大量计算和内存需求使它们具有挑战性地将其部署到具有严格延迟要求的资源受限边缘平台上。我们提出Edgebert，这是一种用于多任务NLP的延迟感知能量优化的深入算法 - 硬件共同设计。 Edgebert采用基于熵的早期出口预测，以执行动态电压缩放（DVF），以句子粒度执行，以最少的能量消耗，同时遵守规定的目标潜伏期。通过采用自适应注意跨度，选择性网络修剪和浮点量化的校准组合，可以进一步缓解计算和记忆足迹开销。此外，为了使这些算法在始终与中边缘计算设置中最大化这些算法的协同益处，我们专门为12nm的可伸缩硬件加速器系统提供专门的，从而使低丢弃电压调节器（LDO）的快速转换，全数位相位隔离的循环（andlocked bemocked Loped loop（Adpll），以及高度且高度高点，以及高密度，以及高密度，以及高密度，以及高点，以及高点，以及高点，以及高度且高度，以及高度，以及高度，以及高度，以及高度且高度，以及高度的范围，以及高度范围，以及高度范围，以及远高度，以及高度且高度的范围，以及且高度且高度，以及高度且高度的范围，以及高度的范围。（envms）其中仔细存储了共享多任务参数的稀疏浮点位编码。与传统的推理相比，Edgebert硬件系统上的延迟意识到的多任务NLP推理加速度最多可产生7倍，2.5倍和53倍的能量，而没有早期停止的常规推理，延迟不添加的早期退出方法，以及在Nvidia Jetson Tegra X2 X2 X2 Mobile Gpu上的CUDA适应。

Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimization for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题