在大规模预处理下的神经半监督学习文本分类

论文标题

在大规模预处理下的神经半监督学习文本分类

Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining

论文作者

Sun, Zijun, Fan, Chun, Sun, Xiaofei, Meng, Yuxian, Wu, Fei, Li, Jiwei

论文摘要

半监督学习的目的是利用未标记的，内域的数据集U来改善在标记的数据集D上训练的模型D。在大规模的语言模型（LM）预处理的背景下，我们如何能够充分了解U的最佳利用是：半纯粹的学习仍然是在大型的范围内实现的。您是否应该用于内域LM预处理或伪标签的产生？如何实际实施基于伪标签的半监督模型？不同的半监督策略如何影响有关不同大小，不同大小等的D的性能。在本文中，我们在大规模LM预处理的背景下，在文本分类的任务中对半监督学习的全面研究进行了全面研究。我们的研究对半监督学习方法的行为阐明了重要的灯光：（1）在U上存在构域的LM，开放域LM预处理是不必要的；（2）内域预审进策略和基于伪标签的策略都引入了显着的性能提升，前者的性能在较大的U中表现更好，后者在较小的U中表现更好，并且结合了最大的性能提升；（3）自我训练（首先在伪标签上进行预处理D'，然后在d上进行微调）在D很小时会产生更好的性能，而在伪标签D'组合进行的联合训练和原始数据集的组合组合时，当D时D会产生更好的性能。使用半监督的学习策略，我们能够在IMDB数据集上仅50个培训数据点，在IMDB数据集上仅50个培训数据点，并且在完整的IMDB数据集中的竞争性能为96.6％。我们的工作标志着在大规模预处理的背景下，了解半监督学习模型的行为的第一步。

The goal of semi-supervised learning is to utilize the unlabeled, in-domain dataset U to improve models trained on the labeled dataset D. Under the context of large-scale language-model (LM) pretraining, how we can make the best use of U is poorly understood: is semi-supervised learning still beneficial with the presence of large-scale pretraining? should U be used for in-domain LM pretraining or pseudo-label generation? how should the pseudo-label based semi-supervised model be actually implemented? how different semi-supervised strategies affect performances regarding D of different sizes, U of different sizes, etc. In this paper, we conduct comprehensive studies on semi-supervised learning in the task of text classification under the context of large-scale LM pretraining. Our studies shed important lights on the behavior of semi-supervised learning methods: (1) with the presence of in-domain pretraining LM on U, open-domain LM pretraining is unnecessary; (2) both the in-domain pretraining strategy and the pseudo-label based strategy introduce significant performance boosts, with the former performing better with larger U, the latter performing better with smaller U, and the combination leading to the largest performance boost; (3) self-training (pretraining first on pseudo labels D' and then fine-tuning on D) yields better performances when D is small, while joint training on the combination of pseudo labels D' and the original dataset D yields better performances when D is large. Using semi-supervised learning strategies, we are able to achieve a performance of around 93.8% accuracy with only 50 training data points on the IMDB dataset, and a competitive performance of 96.6% with the full IMDB dataset. Our work marks an initial step in understanding the behavior of semi-supervised learning models under the context of large-scale pretraining.

下载PDF全文

下载文献需遵守相关版权规定

论文标题