SSMBA：基于自我监督的基于流形的数据增强，以改善室外鲁棒性

论文标题

SSMBA：基于自我监督的基于流形的数据增强，以改善室外鲁棒性

SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness

论文作者

Ng, Nathan, Cho, Kyunghyun, Ghassemi, Marzyeh

论文摘要

在训练领域表现良好的模型通常无法推广到室外（OOD）示例。数据增强是一种常见方法，用于防止过度拟合和改善OOD的概括。但是，在自然语言中，很难生成留在基础数据歧管上的新示例。我们介绍了SSMBA，这是一种数据增强方法，用于通过使用一对损坏和重建功能在数据歧管上随机移动，以生成合成训练示例。我们研究了SSMBA在自然语言领域中的使用，利用了使用蒙版语言模型重建损坏的文本的多种假设。 In experiments on robustness benchmarks across 3 tasks and 9 datasets, SSMBA consistently outperforms existing data augmentation methods and baseline models on both in-domain and OOD data, achieving gains of 0.8% accuracy on OOD Amazon reviews, 1.8% accuracy on OOD MNLI, and 1.4 BLEU on in-domain IWSLT14 German-English.

Models that perform well on a training domain often fail to generalize to out-of-domain (OOD) examples. Data augmentation is a common method used to prevent overfitting and improve OOD generalization. However, in natural language, it is difficult to generate new examples that stay on the underlying data manifold. We introduce SSMBA, a data augmentation method for generating synthetic training examples by using a pair of corruption and reconstruction functions to move randomly on a data manifold. We investigate the use of SSMBA in the natural language domain, leveraging the manifold assumption to reconstruct corrupted text with masked language models. In experiments on robustness benchmarks across 3 tasks and 9 datasets, SSMBA consistently outperforms existing data augmentation methods and baseline models on both in-domain and OOD data, achieving gains of 0.8% accuracy on OOD Amazon reviews, 1.8% accuracy on OOD MNLI, and 1.4 BLEU on in-domain IWSLT14 German-English.

下载PDF全文

下载文献需遵守相关版权规定

论文标题