论文标题

文本平滑:增强文本分类任务上的各种数据增强方法

Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

论文作者

Wu, Xing, Gao, Chaochen, Lin, Meng, Zang, Liangjun, Wang, Zhongyuan, Hu, Songlin

论文摘要

在进入神经网络之前,令牌通常会转换为相应的一hot表示,这是词汇的离散分布。平滑表示是从预先训练的蒙版语言模型中获得的候选令牌的可能性,这可以看作是对单热代表的更有信息的替代。我们提出了一种有效的数据增强方法,称为文本平滑,通过将句子从单式表示形式转换为可控平滑表示形式。我们在低资源制度中评估了不同基准测试的文本平滑。实验结果表明,平滑文本的表现优于各种主流数据增强方法的大幅度。此外,文本平滑可以与这些数据增强方法结合使用,以实现更好的性能。

Before entering the neural network, a token is generally converted to the corresponding one-hot representation, which is a discrete distribution of the vocabulary. Smoothed representation is the probability of candidate tokens obtained from a pre-trained masked language model, which can be seen as a more informative substitution to the one-hot representation. We propose an efficient data augmentation method, termed text smoothing, by converting a sentence from its one-hot representation to a controllable smoothed representation. We evaluate text smoothing on different benchmarks in a low-resource regime. Experimental results show that text smoothing outperforms various mainstream data augmentation methods by a substantial margin. Moreover, text smoothing can be combined with those data augmentation methods to achieve better performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源