论文标题
Moberbert:用于资源有限设备的紧凑型任务不合时宜的BERT
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
论文作者
论文摘要
自然语言处理(NLP)最近通过使用具有数亿个参数的大型预培训模型,取得了巨大的成功。但是,这些模型具有重型模型尺寸和高潜伏期,因此无法将其部署到资源有限的移动设备上。在本文中,我们建议洛夫伯特(Moberbert)压缩和加速流行的BERT模型。像原始的Bert一样,Moberbert是任务不合时宜的,也就是说,它可以通常通过简单的微调应用于各种下游NLP任务。基本上,Moberbert是BERT_LARGE的薄版本,同时配备了瓶颈结构,并且在自我组成和前进网络之间精心设计的平衡。为了训练莫比尔伯特,我们首先培训了专门设计的教师模型,这是一种倒置的底层bert_large模型。然后,我们进行了从这个老师到莫比尔伯特的知识转移。实证研究表明,莫比尔伯特(Moberbert)比bert_base小4.3倍,同时在众所周知的基准上取得了竞争成果。关于胶水的自然语言推理任务,洛夫伯特(Moberbert)达到了胶质o 77.7(比bert_base低0.6),在像素4手机上达到了62毫秒的潜伏期。在小队v1.1/v2.0问题回答任务上,洛夫伯特的开发F1分数为90.0/79.2(比bert_base高1.5/2.1)。
Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).