论文标题
通过平均随机重量改善预训练的语言模型的概括
Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging
论文作者
论文摘要
知识蒸馏(KD)是一种常用技术,用于改善下游任务上紧凑的预训练语言模型(PLM)的概括。但是,这种方法为每个新数据集施加了培训单独的教师模型的额外负担。另外,人们可以直接致力于改进紧凑型模型的优化过程,以更好地泛化。最近的著作观察到,局部最小值的平坦度与更好的概括很好地相关。在这项工作中,我们适应了随机重量平均(SWA),这是一种鼓励收敛至最低限度的方法,以微调PLM。我们对各种NLP任务(文本分类,问题答案和生成)和不同模型体系结构进行了广泛的实验,并证明我们的适应性可以改善概括而无需额外的计算成本。此外,我们观察到,这种简单的优化技术能够优于紧凑型模型的最新KD方法。
Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs) on downstream tasks. However, such methods impose the additional burden of training a separate teacher model for every new dataset. Alternatively, one may directly work on the improvement of the optimization procedure of the compact model toward better generalization. Recent works observe that the flatness of the local minimum correlates well with better generalization. In this work, we adapt Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to fine-tuning PLMs. We conduct extensive experiments on various NLP tasks (text classification, question answering, and generation) and different model architectures and demonstrate that our adaptation improves the generalization without extra computation cost. Moreover, we observe that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.