对抗性自我监督的无数据蒸馏文本分类

论文标题

对抗性自我监督的无数据蒸馏文本分类

Adversarial Self-Supervised Data-Free Distillation for Text Classification

论文作者

Ma, Xinyin, Shen, Yongliang, Fang, Gongfan, Chen, Chen, Jia, Chenghao, Lu, Weiming

论文摘要

大型预训练的基于变压器的语言模型已在各种NLP任务上取得了令人印象深刻的结果。在过去的几年中，知识蒸馏（KD）已成为一种流行的范式，可以将计算昂贵的模型压缩为资源有效的轻量级模型。但是，大多数KD算法，尤其是在NLP中，都依赖于原始培训数据集的可访问性，由于隐私问题，这可能无法使用。为了解决这个问题，我们提出了一种新型的两阶段无数据蒸馏方法，称为对抗性自我监管的无数据蒸馏（AS-DFD），该方法旨在压缩基于大型变压器的模型（例如BERT）。为了避免在离散空间中的文本生成，我们引入了一种插件嵌入猜测方法，以从老师的隐藏知识中制作伪嵌入。同时，通过自我监督的模块来量化学生的能力，我们以对抗性训练的方式适应了伪嵌入的困难。据我们所知，我们的框架是为NLP任务设计的第一个无数据蒸馏框架。我们验证方法对几个文本分类数据集的有效性。

Large pre-trained transformer-based language models have achieved impressive results on a wide range of NLP tasks. In the past few years, Knowledge Distillation(KD) has become a popular paradigm to compress a computationally expensive model to a resource-efficient lightweight model. However, most KD algorithms, especially in NLP, rely on the accessibility of the original training dataset, which may be unavailable due to privacy issues. To tackle this problem, we propose a novel two-stage data-free distillation method, named Adversarial self-Supervised Data-Free Distillation (AS-DFD), which is designed for compressing large-scale transformer-based models (e.g., BERT). To avoid text generation in discrete space, we introduce a Plug & Play Embedding Guessing method to craft pseudo embeddings from the teacher's hidden knowledge. Meanwhile, with a self-supervised module to quantify the student's ability, we adapt the difficulty of pseudo embeddings in an adversarial training manner. To the best of our knowledge, our framework is the first data-free distillation framework designed for NLP tasks. We verify the effectiveness of our method on several text classification datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题