哪个学生最好？针对特定任务的BERT模型的全面知识蒸馏考试

论文标题

哪个学生最好？针对特定任务的BERT模型的全面知识蒸馏考试

Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

论文作者

Nityasya, Made Nindyatama, Wibowo, Haryo Akbarianto, Chevi, Rendi, Prasojo, Radityo Eko, Aji, Alham Fikri

论文摘要

我们执行知识蒸馏（KD）基准，从特定于任务的BERT基础教师模型到各种学生模型：Bilstm，CNN，Bert-Tiny，Bert-Mini和Bert-Small。我们的实验涉及12个数据集，分组为两个任务：文本分类和印尼语言的序列标签。我们还比较了蒸馏的各个方面，包括使用单词嵌入和未标记的数据增强。我们的实验表明，尽管基于变压器的模型的普及程度不断增加，但使用Bilstm和CNN学生模型与Performation和Computational资源（CPU，RAM和存储）之间的最佳权衡相比，与PRUNED BERT模型相比。我们进一步提出一些快速胜利，以执行KD来通过有效的KD训练机制生产小型NLP模型，这些机制涉及简单选择损失功能，单词嵌入和未标记的数据准备。

We perform knowledge distillation (KD) benchmark from task-specific BERT-base teacher models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini, and BERT-Small. Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language. We also compare various aspects of distillations including the usage of word embeddings and unlabeled data augmentation. Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource (CPU, RAM, and storage) compared to pruned BERT models. We further propose some quick wins on performing KD to produce small NLP models via efficient KD training mechanisms involving simple choices of loss functions, word embeddings, and unlabeled data preparation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题