分发数据的价值

论文标题

分发数据的价值

The Value of Out-of-Distribution Data

论文作者

De Silva, Ashwin, Ramesh, Rahul, Priebe, Carey E., Chaudhari, Pratik, Vogelstein, Joshua T.

论文摘要

我们期望通过类似任务的更多样本来改善概括错误，并通过脱离分布（OOD）任务的更多样本恶化。在这项工作中，我们显示了一种违反直觉现象：任务的概括误差可能是OOD样本数量的非单调函数。随着OOD样本的数量的增加，目标任务的概括误差会在阈值超过阈值之上恶化之前提高。换句话说，培训少量的OOD数据具有价值。我们在合成数据集上使用Fisher的线性判别和在计算机视觉基准（例如MNIST，CIFAR-10，CINIC-10，PACS和DOMAINNET）等计算机视觉基准上的深层网络来演示和分析这种现象。在我们知道哪些样本是OOD的理想主义环境中，我们表明这些非单调趋势可以使用目标和OOD经验风险适当加权目标来利用这些非单调趋势。尽管它的实际效用是有限的，但这确实表明，如果我们可以检测到OOD样本，那么可能会有一些方法可以从中受益。当我们不知道哪些样本是OOD时，我们会展示许多首选策略，例如数据启发，超参数优化和预训练不足以确保目标概括误差不会随数据集中的OOD样本数量而降低。

We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task improves before deteriorating beyond a threshold. In other words, there is value in training on small amounts of OOD data. We use Fisher's Linear Discriminant on synthetic datasets and deep networks on computer vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS and DomainNet to demonstrate and analyze this phenomenon. In the idealistic setting where we know which samples are OOD, we show that these non-monotonic trends can be exploited using an appropriately weighted objective of the target and OOD empirical risk. While its practical utility is limited, this does suggest that if we can detect OOD samples, then there may be ways to benefit from them. When we do not know which samples are OOD, we show how a number of go-to strategies such as data-augmentation, hyper-parameter optimization, and pre-training are not enough to ensure that the target generalization error does not deteriorate with the number of OOD samples in the dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题