缩小：用重量多重的压缩视觉变压器

论文标题

缩小：用重量多重的压缩视觉变压器

MiniViT: Compressing Vision Transformers with Weight Multiplexing

论文作者

Zhang, Jinnian, Peng, Houwen, Wu, Kan, Liu, Mengchen, Xiao, Bin, Fu, Jianlong, Yuan, Lu

论文摘要

视觉变压器（VIT）模型最近由于其高模型功能而引起了计算机视觉的广泛关注。但是，VIT模型患有大量参数，限制了其在内存有限的设备上的适用性。为了减轻此问题，我们提出了一个新的压缩框架，该框架可实现视力变压器的参数减少，同时保持相同的性能。缩小的核心思想是将连续变压器块的权重多重。更具体地说，我们使跨层共享的权重同时对权重进行转换以增加多样性。对自我注意力的重量蒸馏也适用于将知识从大规模的VIT模型转移到重量混合的紧凑型模型。全面的实验证明了缩影的功效，表明它可以将预训练的SWIN-B变压器的大小降低48 \％，同时在ImageNet上的TOP-1准确性增加了1.0 \％。此外，使用单层参数，Minivit能够从86m到9M参数将DEIT-B压缩9.7倍，而不会严重损害性能。最后，我们通过报告其在下游基准测试的性能来验证缩影的可传递性。代码和型号在这里可用。

Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability. However, ViT models suffer from huge number of parameters, restricting their applicability on devices with limited memory. To alleviate this problem, we propose MiniViT, a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance. The central idea of MiniViT is to multiplex the weights of consecutive transformer blocks. More specifically, we make the weights shared across layers, while imposing a transformation on the weights to increase diversity. Weight distillation over self-attention is also applied to transfer knowledge from large-scale ViT models to weight-multiplexed compact models. Comprehensive experiments demonstrate the efficacy of MiniViT, showing that it can reduce the size of the pre-trained Swin-B transformer by 48\%, while achieving an increase of 1.0\% in Top-1 accuracy on ImageNet. Moreover, using a single-layer of parameters, MiniViT is able to compress DeiT-B by 9.7 times from 86M to 9M parameters, without seriously compromising the performance. Finally, we verify the transferability of MiniViT by reporting its performance on downstream benchmarks. Code and models are available at here.

下载PDF全文

下载文献需遵守相关版权规定

论文标题