从训练动力学中学习：确定超出手动设计功能的错误标签数据

论文标题

从训练动力学中学习：确定超出手动设计功能的错误标签数据

Learning from Training Dynamics: Identifying Mislabeled Data Beyond Manually Designed Features

论文作者

Jia, Qingrui, Li, Xuhong, Yu, Lei, Bian, Jiang, Zhao, Penghao, Li, Shupeng, Xiong, Haoyi, Dou, Dejing

论文摘要

虽然训练集中的标签错误或标记的样本可能会对深层模型的性能产生负面影响，诊断数据集并识别错误标记的样品有助于提高概括能力。训练动力学，即，通过优化算法迭代留下的痕迹，最近被证明可以有效地将标签错误的样品定位为具有手工制作的功能。在本文中，除了手动设计的功能之外，我们还介绍了一个基于学习的新解决方案，利用了由LSTM网络实现的噪声检测器，该噪声检测器学会了预测样品是否使用原始训练动力学作为输入来预测样品是否被标记。具体而言，所提出的方法使用具有合成标签噪声的数据集以监督方式训练噪声检测器，并且可以在不进行重新培训的情况下适应各种数据集（自然或合成标记的标记）。我们进行广泛的实验来评估所提出的方法。我们根据合成的标记含CIFAR数据集训练噪声探测器，并在Tiny Imagenet，Cub-200，Caltech-256，Webvision和Clothing1M上测试此类噪声检测器。结果表明，所提出的方法精确地检测到不同数据集上的标记样本而没有进一步适应，并且表现优于最先进的方法。此外，更多的实验表明，错标识别可以指导标签校正，即数据调试，从数据方面提供以算法为中心的最新技术的正交改进。

While mislabeled or ambiguously-labeled samples in the training set could negatively affect the performance of deep models, diagnosing the dataset and identifying mislabeled samples helps to improve the generalization power. Training dynamics, i.e., the traces left by iterations of optimization algorithms, have recently been proved to be effective to localize mislabeled samples with hand-crafted features. In this paper, beyond manually designed features, we introduce a novel learning-based solution, leveraging a noise detector, instanced by an LSTM network, which learns to predict whether a sample was mislabeled using the raw training dynamics as input. Specifically, the proposed method trains the noise detector in a supervised manner using the dataset with synthesized label noises and can adapt to various datasets (either naturally or synthesized label-noised) without retraining. We conduct extensive experiments to evaluate the proposed method. We train the noise detector based on the synthesized label-noised CIFAR dataset and test such noise detector on Tiny ImageNet, CUB-200, Caltech-256, WebVision and Clothing1M. Results show that the proposed method precisely detects mislabeled samples on various datasets without further adaptation, and outperforms state-of-the-art methods. Besides, more experiments demonstrate that the mislabel identification can guide a label correction, namely data debugging, providing orthogonal improvements of algorithm-centric state-of-the-art techniques from the data aspect.

下载PDF全文

下载文献需遵守相关版权规定

论文标题