Moberbert：用于资源有限设备的紧凑型任务不合时宜的BERT

论文标题

Moberbert：用于资源有限设备的紧凑型任务不合时宜的BERT

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

论文作者

Sun, Zhiqing, Yu, Hongkun, Song, Xiaodan, Liu, Renjie, Yang, Yiming, Zhou, Denny

论文摘要

自然语言处理（NLP）最近通过使用具有数亿个参数的大型预培训模型，取得了巨大的成功。但是，这些模型具有重型模型尺寸和高潜伏期，因此无法将其部署到资源有限的移动设备上。在本文中，我们建议洛夫伯特（Moberbert）压缩和加速流行的BERT模型。像原始的Bert一样，Moberbert是任务不合时宜的，也就是说，它可以通常通过简单的微调应用于各种下游NLP任务。基本上，Moberbert是BERT_LARGE的薄版本，同时配备了瓶颈结构，并且在自我组成和前进网络之间精心设计的平衡。为了训练莫比尔伯特，我们首先培训了专门设计的教师模型，这是一种倒置的底层bert_large模型。然后，我们进行了从这个老师到莫比尔伯特的知识转移。实证研究表明，莫比尔伯特（Moberbert）比bert_base小4.3倍，同时在众所周知的基准上取得了竞争成果。关于胶水的自然语言推理任务，洛夫伯特（Moberbert）达到了胶质o 77.7（比bert_base低0.6），在像素4手机上达到了62毫秒的潜伏期。在小队v1.1/v2.0问题回答任务上，洛夫伯特的开发F1分数为90.0/79.2（比bert_base高1.5/2.1）。

Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).

下载PDF全文

下载文献需遵守相关版权规定

论文标题