论文标题
在长尾样本分布中识别硬噪声
Identifying Hard Noise in Long-Tailed Sample Distribution
论文作者
论文摘要
常规的去噪声方法依赖于所有样品都是独立且分布相同的假设,因此,虽然被噪声干扰的结果分类器仍然可以轻松地将噪声识别为训练分布的异常值。但是,在不可避免地长尾的大规模数据中,该假设是不现实的。这种不平衡的训练数据使分类器对尾巴类别的歧视性降低,而尾部类别的差异化现在变成了“硬”的噪声 - 它们几乎与干净的尾巴样品一样离群值。我们将这一新挑战介绍为嘈杂的长尾分类(NLT)。毫不奇怪,我们发现大多数拖延方法无法识别出硬噪声,从而在提出的三个NLT基准测试中导致了大幅度的性能下降:Imagenet-NLT,Animal10-NLT和Food101-NLT。为此,我们设计了一个迭代嘈杂的学习框架,称为“难以容易”(H2E)。我们的引导理念是首先学习一个分类器,因为噪声标识符是班级和上下文分布变化不变的,将“硬”噪声减少到“ Easy”的噪声,其删除进一步提高了不变性。实验结果表明,我们的H2E胜过最先进的方法及其在长尾设置上的消融,同时保持在常规平衡设置上的稳定性能。数据集和代码可在https://github.com/yxymessi/h2e-framework上找到
Conventional de-noising methods rely on the assumption that all samples are independent and identically distributed, so the resultant classifier, though disturbed by noise, can still easily identify the noises as the outliers of training distribution. However, the assumption is unrealistic in large-scale data that is inevitably long-tailed. Such imbalanced training data makes a classifier less discriminative for the tail classes, whose previously "easy" noises are now turned into "hard" ones -- they are almost as outliers as the clean tail samples. We introduce this new challenge as Noisy Long-Tailed Classification (NLT). Not surprisingly, we find that most de-noising methods fail to identify the hard noises, resulting in significant performance drop on the three proposed NLT benchmarks: ImageNet-NLT, Animal10-NLT, and Food101-NLT. To this end, we design an iterative noisy learning framework called Hard-to-Easy (H2E). Our bootstrapping philosophy is to first learn a classifier as noise identifier invariant to the class and context distributional changes, reducing "hard" noises to "easy" ones, whose removal further improves the invariance. Experimental results show that our H2E outperforms state-of-the-art de-noising methods and their ablations on long-tailed settings while maintaining a stable performance on the conventional balanced settings. Datasets and codes are available at https://github.com/yxymessi/H2E-Framework