论文标题
minilmv2:用于压缩预验力变压器的多头自我注意关系蒸馏
MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers
论文作者
论文摘要
我们仅利用自我注意关系蒸馏对预识别的变压器的任务不合时宜的压缩来推广微小的自我发注意力蒸馏(Wang等,2020)。特别是,我们将多头自发关系定义为每个自我发场模块中的查询,键和价值向量之间的缩放点产物。然后,我们利用上述关系知识来培训学生模型。除了其简单性和统一原则外,更有利的是,对学生的注意力的数量没有限制,而以前的大多数工作都必须保证教师和学生之间的相同头号。此外,细粒度的自我发场关系倾向于完全利用变压器学到的相互作用知识。此外,我们彻底检查了教师模型的层选择策略,而不仅仅是依靠Minilm中的最后一层。我们对压缩单语和多语言预审计模型进行了广泛的实验。实验结果表明,我们的模型从基础大小和大尺寸的教师(Bert,Roberta和XLM-R)中蒸馏出来,优于最先进的模型。
We generalize deep self-attention distillation in MiniLM (Wang et al., 2020) by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers. In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each self-attention module. Then we employ the above relational knowledge to train the student model. Besides its simplicity and unified principle, more favorably, there is no restriction in terms of the number of student's attention heads, while most previous work has to guarantee the same head number between teacher and student. Moreover, the fine-grained self-attention relations tend to fully exploit the interaction knowledge learned by Transformer. In addition, we thoroughly examine the layer selection strategy for teacher models, rather than just relying on the last layer as in MiniLM. We conduct extensive experiments on compressing both monolingual and multilingual pretrained models. Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, RoBERTa and XLM-R) outperform the state-of-the-art.