自我训练避免在域移动下使用虚假特征

论文标题

自我训练避免在域移动下使用虚假特征

Self-training Avoids Using Spurious Features Under Domain Shift

论文作者

Chen, Yining, Wei, Colin, Kumar, Ananya, Ma, Tengyu

论文摘要

在无监督的域适应性中，现有理论重点关注源和目标域接近的情况。在实践中，有条件的熵最小化和伪标记的工作，即使域的变化比现有理论分析的变化大得多。我们识别并分析了一个特定的设置，其中域移动可能很大，但是这些算法可行：某些虚假特征与源域中的标签相关，但与目标中的标签无关。我们的分析考虑了线性分类，其中虚假特征是高斯，而非流浪特征是对数凸出分布的混合物。对于这种设置，我们证明，如果目标是非源分类器初始初始化，则在未标记的目标数据上最小化的熵最小化将避免使用伪造功能，即使目标是非convex，并且使用虚假特征包含多个局部局部局部最小值。我们验证了对半合成Celeb-A和MNIST数据集的虚假域转移任务的理论。我们的结果表明，从业人员在大型，多样化的数据集上收集和自我培训，以减少分类器中的偏见，即使标签是不切实际的。

In unsupervised domain adaptation, existing theory focuses on situations where the source and target domains are close. In practice, conditional entropy minimization and pseudo-labeling work even when the domain shifts are much larger than those analyzed by existing theory. We identify and analyze one particular setting where the domain shift can be large, but these algorithms provably work: certain spurious features correlate with the label in the source domain but are independent of the label in the target. Our analysis considers linear classification where the spurious features are Gaussian and the non-spurious features are a mixture of log-concave distributions. For this setting, we prove that entropy minimization on unlabeled target data will avoid using the spurious feature if initialized with a decently accurate source classifier, even though the objective is non-convex and contains multiple bad local minima using the spurious features. We verify our theory for spurious domain shift tasks on semi-synthetic Celeb-A and MNIST datasets. Our results suggest that practitioners collect and self-train on large, diverse datasets to reduce biases in classifiers even if labeling is impractical.

下载PDF全文

下载文献需遵守相关版权规定

论文标题