论文标题
全球混合:消除聚类的歧义
Global Mixup: Eliminating Ambiguity with Clustering
论文作者
论文摘要
用\ textbf {Mixup}的数据增强已被证明是正规化当前深神经网络的有效方法。混音通过线性插值立即生成虚拟样品和相应的标签。但是,这种一阶段的范式和线性插值的使用具有以下两个缺陷:(1)生成样品的标签直接从原始样本对的标签中直接合并而没有合理的判断,这使得标签可能模棱两可。 (2)线性组合显着限制了生成样品的采样空间。为了解决这些问题,我们提出了一种基于名为\ textbf {Global Mixup}的全局聚类关系的新颖有效的增强方法。具体而言,我们将先前的一阶段增强过程转换为两个阶段,从而将产生虚拟样品从标签中生成的过程。对于生成样品的标签,通过计算生成样品的全局关系来基于聚类进行重新标记。此外,我们不再限于线性关系,而是在较大的采样空间中生成更可靠的虚拟样本。 \ textbf {CNN},\ textbf {lstm}和\ textbf {bert}的广泛实验表明,全局混合量显着胜过先前的最先进的基线。进一步的实验还证明了在低资源场景中全球混合的优势。
Data augmentation with \textbf{Mixup} has been proven an effective method to regularize the current deep neural networks. Mixup generates virtual samples and corresponding labels at once through linear interpolation. However, this one-stage generation paradigm and the use of linear interpolation have the following two defects: (1) The label of the generated sample is directly combined from the labels of the original sample pairs without reasonable judgment, which makes the labels likely to be ambiguous. (2) linear combination significantly limits the sampling space for generating samples. To tackle these problems, we propose a novel and effective augmentation method based on global clustering relationships named \textbf{Global Mixup}. Specifically, we transform the previous one-stage augmentation process into two-stage, decoupling the process of generating virtual samples from the labeling. And for the labels of the generated samples, relabeling is performed based on clustering by calculating the global relationships of the generated samples. In addition, we are no longer limited to linear relationships but generate more reliable virtual samples in a larger sampling space. Extensive experiments for \textbf{CNN}, \textbf{LSTM}, and \textbf{BERT} on five tasks show that Global Mixup significantly outperforms previous state-of-the-art baselines. Further experiments also demonstrate the advantage of Global Mixup in low-resource scenarios.