论文标题
文本平滑:增强文本分类任务上的各种数据增强方法
Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks
论文作者
论文摘要
在进入神经网络之前,令牌通常会转换为相应的一hot表示,这是词汇的离散分布。平滑表示是从预先训练的蒙版语言模型中获得的候选令牌的可能性,这可以看作是对单热代表的更有信息的替代。我们提出了一种有效的数据增强方法,称为文本平滑,通过将句子从单式表示形式转换为可控平滑表示形式。我们在低资源制度中评估了不同基准测试的文本平滑。实验结果表明,平滑文本的表现优于各种主流数据增强方法的大幅度。此外,文本平滑可以与这些数据增强方法结合使用,以实现更好的性能。
Before entering the neural network, a token is generally converted to the corresponding one-hot representation, which is a discrete distribution of the vocabulary. Smoothed representation is the probability of candidate tokens obtained from a pre-trained masked language model, which can be seen as a more informative substitution to the one-hot representation. We propose an efficient data augmentation method, termed text smoothing, by converting a sentence from its one-hot representation to a controllable smoothed representation. We evaluate text smoothing on different benchmarks in a low-resource regime. Experimental results show that text smoothing outperforms various mainstream data augmentation methods by a substantial margin. Moreover, text smoothing can be combined with those data augmentation methods to achieve better performance.