不确定性意识到文本分类的自我训练很少

论文标题

不确定性意识到文本分类的自我训练很少

Uncertainty-aware Self-training for Text Classification with Few Labels

论文作者

Mukherjee, Subhabrata, Awadallah, Ahmed Hassan

论文摘要

大规模训练的大规模训练语言模型的最新成功在对下游任务的大量标记数据上进行微调，这通常是昂贵的。在这项工作中，我们将自我训练作为最早的半监督学习方法之一，通过利用大规模的未标记数据来减少注释瓶颈。标准的自我训练机制将实例从未标记的库中随机采样到伪标签和增强标记的数据。在这项工作中，我们提出了一种方法来通过纳入利用贝叶斯深度学习最新进展的潜在神经网络的不确定性估计来改善自我训练的方法。具体而言，我们提出（i）采集函数以从未标记的池中选择蒙特卡洛（MC）辍学的池中选择实例，并且（ii）学习机制利用模型信心进行自我训练。作为一个应用程序，我们专注于五个基准数据集上的文本分类。我们显示了每个任务仅利用20-30个标签样本进行培训的方法，进行验证可以在全面监督的预训练的语言模型的3％以内，对数千个标记的实例进行了微调，总准确性为91％，并且在基本线上提高了12％。

Recent success of large-scale pre-trained language models crucially hinge on fine-tuning them on large amounts of labeled data for the downstream task, that are typically expensive to acquire. In this work, we study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck by making use of large-scale unlabeled data for the target task. Standard self-training mechanism randomly samples instances from the unlabeled pool to pseudo-label and augment labeled data. In this work, we propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network leveraging recent advances in Bayesian deep learning. Specifically, we propose (i) acquisition functions to select instances from the unlabeled pool leveraging Monte Carlo (MC) Dropout, and (ii) learning mechanism leveraging model confidence for self-training. As an application, we focus on text classification on five benchmark datasets. We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models fine-tuned on thousands of labeled instances with an aggregate accuracy of 91% and improving by upto 12% over baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题