论文标题

Spatten:带有级联令牌和头部修剪的有效稀疏注意体系结构

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

论文作者

Wang, Hanrui, Zhang, Zhekai, Han, Song

论文摘要

注意机制在自然语言处理(NLP)应用中变得越来越流行,比卷积和经常性架构显示出优越的性能。然而,由于其对输入长度,复杂的数据运动和低算术强度的二次计算复杂性,注意力变成了综合瓶颈。此外,现有的NN加速器主要集中于优化卷积或经常性模型,并且无法有效地支持注意力。在本文中,我们提出了Spatten,这是一种有效的算法 - 构造共同设计,它利用了令牌的稀疏性,稀疏性和量化机会,以减少注意力计算和记忆访问。受到人类语言的高冗余的启发,我们提出了小说的级联代币修剪,以修剪句子中不重要的令牌。我们还建议级联头修剪以去除不必要的头部。级联修剪与重量修剪的根本不同,因为注意力机制没有可训练的重量,并且可以随时选择修剪的令牌和头部。为了在硬件上有效地支持它们,我们设计了一种新颖的Top-K引擎,以高吞吐量对令牌和头部重要性得分进行排名和头部重要性得分。此外,我们提出了首先获取MSB并执行计算的进行性量化。如果置信度较低,它将获取LSB并重新计算注意力输出,以减少记忆的交易计算。 30个基准测试的广泛实验表明,平均而言,Spatten可以将DRAM访问减少10.0倍,而没有准确的损失,并且达到1.6倍,3.0倍,162倍,347倍加速,1,4x,3.2 x,1193x,1193x,4059 x,4059倍的能量,而不是A3 Accelerator,mnaverator,mnscelerator,mnnfaster,mnnfastecy cppu gpu gpu gpu gpu gpu gpu gpu gpu gpu&titan xpu,xpu,xpu,xpu gpu gpu&titan。

The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, attention becomes the compution bottleneck because of its quadratic computational complexity to input length, complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction. Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0x with no accuracy loss, and achieves 1.6x, 3.0x, 162x, 347x speedup, and 1,4x, 3.2x, 1193x, 4059x energy savings over A3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源