稳健和直率的数据集Denoising用于图像分类

论文标题

稳健和直率的数据集Denoising用于图像分类

Robust and On-the-fly Dataset Denoising for Image Classification

论文作者

Song, Jiaming, Hu, Lunjia, Auli, Michael, Dauphin, Yann, Ma, Tengyu

论文摘要

在存在错误的例子的情况下，过度参数化的神经网络中的记忆可能会严重损害概括。但是，在弱监督下收集的非常大的数据集中，很难避免标签的示例。我们通过反合反合对具有统一随机标签的示例的损失分布进行推理，如果他们接受了真实示例的培训，并使用此信息来删除训练集中的嘈杂示例。首先，我们观察到，在较大的学习率下接受随机梯度下降训练时，具有均匀随机标签的示例会损失较高。然后，我们建议仅使用网络参数对反事实示例的损失分布进行建模，该参数能够以显着的成功对此类示例进行建模。最后，我们提议删除损失超过建模损耗分布的一定分位数的示例。这导致了直率的数据降解（ODD），这是一种简单而有效的算法，与标准培训相比，在引入几乎零的计算间接费用的同时，对错误标记的示例进行了强大的效果。 ODD能够在广泛的数据集上获得最新的结果，包括现实世界中的网络视频和服装1M。

Memorization in over-parameterized neural networks could severely hurt generalization in the presence of mislabeled examples. However, mislabeled examples are hard to avoid in extremely large datasets collected with weak supervision. We address this problem by reasoning counterfactually about the loss distribution of examples with uniform random labels had they were trained with the real examples, and use this information to remove noisy examples from the training set. First, we observe that examples with uniform random labels have higher losses when trained with stochastic gradient descent under large learning rates. Then, we propose to model the loss distribution of the counterfactual examples using only the network parameters, which is able to model such examples with remarkable success. Finally, we propose to remove examples whose loss exceeds a certain quantile of the modeled loss distribution. This leads to On-the-fly Data Denoising (ODD), a simple yet effective algorithm that is robust to mislabeled examples, while introducing almost zero computational overhead compared to standard training. ODD is able to achieve state-of-the-art results on a wide range of datasets including real-world ones such as WebVision and Clothing1M.

下载PDF全文

下载文献需遵守相关版权规定

论文标题