论文标题
解散数据以改善概括
Unshuffling Data for Improved Generalization
论文作者
论文摘要
超出培训分配的概括是机器学习的核心挑战。当训练神经网络训练和改组的示例的常见实践在这方面可能不是最佳的。我们显示将数据分配到精心选择的Non-I.I.D中。被视为多种培训环境的子集可以指导以更好的分布概括来学习模型的学习。我们描述了一种培训程序,以捕获跨环境稳定的模式,同时丢弃虚假的模式。该方法超出了基于相关的学习范围:分区的选择允许注入有关该任务的信息,而这些任务是无法从培训数据的联合分布中恢复的。我们将多个用例演示为视觉问题回答的任务,这对于数据集偏见而言是臭名昭著的。使用先验知识,现有元数据或无监督的聚类构建的环境,我们可以对VQA-CP进行重大改进。我们还使用“等效问题”的注释以及通过将它们视为不同的环境来改进GQA,并通过“等效问题”(VQA V2 / Visual Genome)进行改进。
Generalization beyond the training distribution is a core challenge in machine learning. The common practice of mixing and shuffling examples when training neural networks may not be optimal in this regard. We show that partitioning the data into well-chosen, non-i.i.d. subsets treated as multiple training environments can guide the learning of models with better out-of-distribution generalization. We describe a training procedure to capture the patterns that are stable across environments while discarding spurious ones. The method makes a step beyond correlation-based learning: the choice of the partitioning allows injecting information about the task that cannot be otherwise recovered from the joint distribution of the training data. We demonstrate multiple use cases with the task of visual question answering, which is notorious for dataset biases. We obtain significant improvements on VQA-CP, using environments built from prior knowledge, existing meta data, or unsupervised clustering. We also get improvements on GQA using annotations of "equivalent questions", and on multi-dataset training (VQA v2 / Visual Genome) by treating them as distinct environments.